Read failure when adding node + move; Or: What is the right way to add a node?
Initial state: 3 nodes, RF=3, version = 0.7.8, some queries are with CL=QUORUM 1. Add node with with correct token for 4 nodes, repair 2. Move first node to balance 4 nodes, repair 3. Move second node === Start getting timeouts, Hector warning: WARNING - Error: me.prettyprint.hector.api.exceptions.HUnavailableException: : May not be enough replicas present to handle consistency level. What is going on? My traffic isn't high. None of my nodes' logs show ANYTHING during the move 4. When the node finishes moving, the timeouts stop happening Is there some state in the above scenario that I don't have the required replication of at least 2?
Re: deleted counters keeps their value?
The reason why counters work is that addition is commutative, i.e. x + y = y + x but deletes are not commutative, i.e. x + delete ≠ delete + x so the result depends on the order in which the messages arrive. 2011/9/21 Radim Kolar h...@sendmail.cz Dne 21.9.2011 12:07, aaron morton napsal(a): see technical limitations for deleting counters http://wiki.apache.org/** cassandra/Counters http://wiki.apache.org/cassandra/Counters For instance, if you issue very quickly the sequence increment, remove, increment it is possible for the removal to be lost (if for some reason the remove happens to be the last received messages). But i do not remove then very quickly. it does that even with 60 seconds between delete and increment. I do not understand what means: remove happens to be the last received messages.
What causes dropped messages?
How can I tell what's causing dropped messages? Is it just too much activity? I'm not getting any other, more specific messages, just these: WARN [ScheduledTasks:1] 2011-08-15 11:33:26,136 MessagingService.java (line 504) Dropped 1534 MUTATION messages in the last 5000ms WARN [ScheduledTasks:1] 2011-08-15 11:33:26,137 MessagingService.java (line 504) Dropped 58 READ_REPAIR messages in the last 5000ms
Re: Changing the CLI, not a great idea!
This is part of a much bigger problem, one which has many parts, among them: 1. Cassandra is complex. Getting a gestalt understanding of it makes me think I understand how Alzheimer's patients must feel. 2. There is no official documentation. Perhaps everything is out there somewhere, who knows? 3. Cassandra is a moving target. Books are out of date before they hit the press. 4. Most of the important knowledge about Cassandra exists in a kind of oral history, that is hard to keep up with, and even harder to understand once it's long past. I think it is clear that we need a better one-stop-shop for good documentation. What hasn't been talked about much - but I think it's just as important - is a good one-stop-shop for Cassandra's oral history. (You might think this list is the place, but it's too noisy to be useful, except at the very tip of the cowcatcher. Cassandra needs a canonized version of its oral history.) On Thu, Jul 28, 2011 at 7:24 AM, Edward Capriolo edlinuxg...@gmail.comwrote: On Thu, Jul 28, 2011 at 12:01 AM, Jonathan Ellis jbel...@gmail.comwrote: On Wed, Jul 27, 2011 at 10:53 PM, Edward Capriolo edlinuxg...@gmail.com wrote: You can not even put two statements on the same line. So the ';' is semi useless syntax. Nobody ever asked for that, but lots of people asked to allow statements spanning multiple lines. Is their a way to move things forward without hurting backwards compatibility of the CLI? Yes. Create a new one based on CQL but leave the old one around. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com On a semi related note. How can you update a column family and add an index? [default@app] create column family people; 4e3310c0-b8d1-11e0--242d50cf1f9f Waiting for schema agreement... ... schemas agree across the cluster [default@app] update column family people with column_metadata = [{ column_name : ascii(inserted_at), validation_class : LongType , index_type : 0 , index_name : ins_idx}]; org.apache.cassandra.db.marshal.MarshalException: cannot parse 'FUNCTION_CALL' as hex bytes [default@app] update column family people with column_metadata = [{ column_name : inserted_at, validation_class : LongType , index_type : 0 , index_name : ins_idx}]; org.apache.cassandra.db.marshal.MarshalException: cannot parse 'inserted_at' as hex bytes Edward
Re: CompositeType for row Keys
Why do you need another CF? Is there something wrong with repeating the key as a column and indexing it? On Fri, Jul 22, 2011 at 7:40 PM, Patrick Julien pjul...@gmail.com wrote: Exactly. In any case, I just answered my own question. If I need range, I can just make another column family where the column name are these keys On Fri, Jul 22, 2011 at 12:37 PM, Nate McCall n...@datastax.com wrote: yes,but why would you use CompositeType if you don't need range query? If you were doing composite keys anyway (common approach with time series data for example), you would not have to write parsing and concatenation code. Particularly useful if you had mixed types in the key.
Re: Repair taking a long, long time
I have this problem too, and I don't understand why. I can repair my nodes very quickly by looping though all my data (when you read your data it does read-repair), but nodetool repair takes forever. I understand that nodetool repair builds merkle trees, etc. etc., so it's a different algorithm, but why can't nodetool repair be smart enough to choose the best algorithm? Also, I don't understand what's special about my data that makes nodetool repair so much slower than looping through all my data. On Wed, Jul 20, 2011 at 12:18 AM, Maxim Potekhin potek...@bnl.gov wrote: Thanks Edward. I'm told by our IT that the switch connecting the nodes is pretty fast. Seriously, in my house I copy complete DVD images from my bedroom to the living room downstairs via WiFi, and a dozen of GB does not seem like a problem, on dirt cheap hardware (Patriot Box Office). I also have just _one_ column major family but caveat emptor -- 8 indexes attached to it (and there will be more). There is one accounting CF which is small, can't possibly make a difference. By contrast, compaction (as in nodetool) performs quite well on this cluster. I start suspecting some sort of malfunction. Looked at the system log during the repair, there is some compaction agent doing work that I'm not sure makes sense (and I didn't call for it). Disk utilization all of a sudden goes up to 40% per Ganglia, and stays there, this is pretty silly considering the cluster is IDLE and we have SSDs. No external writes, no reads. There are occasional GC stoppages, but these I can live with. This repair debacle happens 2nd time in a row. Cr@p. I need to go to production soon and that doesn't look good at all. If I can't manage a system that simple (and/or get help on this list) I may have to cut losses i.e. stay with Oracle. Regards, Maxim On 7/19/2011 12:16 PM, Edward Capriolo wrote: Well most SSD's are pretty fast. There is one more to consider. If Cassandra determines nodes are out of sync it has to transfer data across the network. If that is the case you have to look at 'nodetool streams' and determine how much data is being transferred between nodes. There are some open tickets where with larger tables repair is streaming more then it needs to. But even if the transfers are only 10% of your 200GB. Transferring 20 GB is not trivial. If you have multiple keyspaces and column families repair one at a time might make the process more manageable.
Re: Repair taking a long, long time
As I indicated below (but didn't say specifically) another option is to set read repair chance to 1.0 for all your CFs and loop over all your data, since read triggers a read repair. On Wed, Jul 20, 2011 at 4:58 PM, Maxim Potekhin potek...@bnl.gov wrote: ** I can re-load all data that I have in the cluster, from a flat-file cache I have on NFS, many times faster than the nodetool repair takes. And that's not even accurate because as other noted nodetool repair eats up disk space for breakfast and takes more than 24hrs on 200GB data load, at which point I have to cancel. That's not acceptable. I simply don't know what to do now. On 7/20/2011 8:47 AM, David Boxenhorn wrote: I have this problem too, and I don't understand why. I can repair my nodes very quickly by looping though all my data (when you read your data it does read-repair), but nodetool repair takes forever. I understand that nodetool repair builds merkle trees, etc. etc., so it's a different algorithm, but why can't nodetool repair be smart enough to choose the best algorithm? Also, I don't understand what's special about my data that makes nodetool repair so much slower than looping through all my data. On Wed, Jul 20, 2011 at 12:18 AM, Maxim Potekhin potek...@bnl.gov wrote: Thanks Edward. I'm told by our IT that the switch connecting the nodes is pretty fast. Seriously, in my house I copy complete DVD images from my bedroom to the living room downstairs via WiFi, and a dozen of GB does not seem like a problem, on dirt cheap hardware (Patriot Box Office). I also have just _one_ column major family but caveat emptor -- 8 indexes attached to it (and there will be more). There is one accounting CF which is small, can't possibly make a difference. By contrast, compaction (as in nodetool) performs quite well on this cluster. I start suspecting some sort of malfunction. Looked at the system log during the repair, there is some compaction agent doing work that I'm not sure makes sense (and I didn't call for it). Disk utilization all of a sudden goes up to 40% per Ganglia, and stays there, this is pretty silly considering the cluster is IDLE and we have SSDs. No external writes, no reads. There are occasional GC stoppages, but these I can live with. This repair debacle happens 2nd time in a row. Cr@p. I need to go to production soon and that doesn't look good at all. If I can't manage a system that simple (and/or get help on this list) I may have to cut losses i.e. stay with Oracle. Regards, Maxim On 7/19/2011 12:16 PM, Edward Capriolo wrote: Well most SSD's are pretty fast. There is one more to consider. If Cassandra determines nodes are out of sync it has to transfer data across the network. If that is the case you have to look at 'nodetool streams' and determine how much data is being transferred between nodes. There are some open tickets where with larger tables repair is streaming more then it needs to. But even if the transfers are only 10% of your 200GB. Transferring 20 GB is not trivial. If you have multiple keyspaces and column families repair one at a time might make the process more manageable.
Re: Default behavior of generate index_name for columns...
I have lots of indexes on columns with the same name. Why don't I have this problem? For example: Keyspace: City: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Replication Factor: 3 Column Families: ColumnFamily: AttractionCheckins Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 0.1/14400 Memtable thresholds: 0.3/64/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/64 Read repair chance: 0.01 Column Metadata: Column Name: 09partition (09partition) Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Type: KEYS ColumnFamily: Attractions Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 3.0/14400 Key cache size / save period: 3.0/14400 Memtable thresholds: 0.3/64/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/64 Read repair chance: 0.01 Column Metadata: Column Name: 09partition (09partition) Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Type: KEYS ColumnFamily: CityResources Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 5000.0/14400 Key cache size / save period: 5000.0/14400 Memtable thresholds: 0.3/64/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/64 Read repair chance: 0.01 Column Metadata: Column Name: 09partition (09partition) Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Type: KEYS On Mon, Jul 18, 2011 at 8:20 AM, Boris Yen yulin...@gmail.com wrote: Will this have any side effect when doing a get_indexed_slices or when a user wants to drop an index by any means? Boris On Mon, Jul 18, 2011 at 1:13 PM, Jonathan Ellis jbel...@gmail.com wrote: 0.8.0 didn't check for name conflicts correctly. 0.8.1 does, but it can't fix the ones 0.8.0 allowed, retroactively. On Sun, Jul 17, 2011 at 11:52 PM, Boris Yen yulin...@gmail.com wrote: I have tested another case, not sure if this is a bug. I created a few column families on 0.8.0 each has user_name column, in addition, I also enabled secondary index on this column. Then, I upgraded to 0.8.1, when I used cassandra-cli: show keyspaces, I saw index name user_name_idx appears for different columns families. It seems the validation rule for index_name on 0.8.1 has been skipped completely. Is this a bug? or is it intentional? Regards Boris On Sat, Jul 16, 2011 at 10:38 AM, Boris Yen yulin...@gmail.com wrote: Done. It is CASSANDRA-2903. On Sat, Jul 16, 2011 at 9:44 AM, Jonathan Ellis jbel...@gmail.com wrote: Please. On Fri, Jul 15, 2011 at 7:29 PM, Boris Yen yulin...@gmail.com wrote: Hi Jonathan, Do I need to open a ticket for this? Regards Boris On Sat, Jul 16, 2011 at 6:29 AM, Jonathan Ellis jbel...@gmail.com wrote: Sounds reasonable to me. On Fri, Jul 15, 2011 at 2:55 AM, Boris Yen yulin...@gmail.com wrote: Hi, I have a few column families, each has a column called user_name. I tried to use secondary index on user_name column for each of the column family. However, when creating these column families, cassandra keeps reporting Duplicate index name... exception. I finally figured out that it seems the default index name is column name+_idx, this make my column family violate the uniqueness of index name rule. I was wondering if the default index_name generating rule could be like column name+cf name, so the index name would not collide with each other that easily, if the user do not assign index_name when creating a column family. Regards Boris -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Default behavior of generate index_name for columns...
Ah, that's it. I'm on 0.7 On Mon, Jul 18, 2011 at 1:27 PM, Boris Yen yulin...@gmail.com wrote: which version of cassandra do you use? What I mentioned here only happens on 0.8.1. On Mon, Jul 18, 2011 at 4:44 PM, David Boxenhorn da...@citypath.comwrote: I have lots of indexes on columns with the same name. Why don't I have this problem? For example: Keyspace: City: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Replication Factor: 3 Column Families: ColumnFamily: AttractionCheckins Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 0.0/0 Key cache size / save period: 0.1/14400 Memtable thresholds: 0.3/64/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/64 Read repair chance: 0.01 Column Metadata: Column Name: 09partition (09partition) Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Type: KEYS ColumnFamily: Attractions Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 3.0/14400 Key cache size / save period: 3.0/14400 Memtable thresholds: 0.3/64/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/64 Read repair chance: 0.01 Column Metadata: Column Name: 09partition (09partition) Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Type: KEYS ColumnFamily: CityResources Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period: 5000.0/14400 Key cache size / save period: 5000.0/14400 Memtable thresholds: 0.3/64/60 GC grace seconds: 864000 Compaction min/max thresholds: 4/64 Read repair chance: 0.01 Column Metadata: Column Name: 09partition (09partition) Validation Class: org.apache.cassandra.db.marshal.UTF8Type Index Type: KEYS On Mon, Jul 18, 2011 at 8:20 AM, Boris Yen yulin...@gmail.com wrote: Will this have any side effect when doing a get_indexed_slices or when a user wants to drop an index by any means? Boris On Mon, Jul 18, 2011 at 1:13 PM, Jonathan Ellis jbel...@gmail.comwrote: 0.8.0 didn't check for name conflicts correctly. 0.8.1 does, but it can't fix the ones 0.8.0 allowed, retroactively. On Sun, Jul 17, 2011 at 11:52 PM, Boris Yen yulin...@gmail.com wrote: I have tested another case, not sure if this is a bug. I created a few column families on 0.8.0 each has user_name column, in addition, I also enabled secondary index on this column. Then, I upgraded to 0.8.1, when I used cassandra-cli: show keyspaces, I saw index name user_name_idx appears for different columns families. It seems the validation rule for index_name on 0.8.1 has been skipped completely. Is this a bug? or is it intentional? Regards Boris On Sat, Jul 16, 2011 at 10:38 AM, Boris Yen yulin...@gmail.com wrote: Done. It is CASSANDRA-2903. On Sat, Jul 16, 2011 at 9:44 AM, Jonathan Ellis jbel...@gmail.com wrote: Please. On Fri, Jul 15, 2011 at 7:29 PM, Boris Yen yulin...@gmail.com wrote: Hi Jonathan, Do I need to open a ticket for this? Regards Boris On Sat, Jul 16, 2011 at 6:29 AM, Jonathan Ellis jbel...@gmail.com wrote: Sounds reasonable to me. On Fri, Jul 15, 2011 at 2:55 AM, Boris Yen yulin...@gmail.com wrote: Hi, I have a few column families, each has a column called user_name. I tried to use secondary index on user_name column for each of the column family. However, when creating these column families, cassandra keeps reporting Duplicate index name... exception. I finally figured out that it seems the default index name is column name+_idx, this make my column family violate the uniqueness of index name rule. I was wondering if the default index_name generating rule could be like column name+cf name, so the index name would not collide with each other that easily, if the user do not assign index_name when creating a column family. Regards Boris -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Default behavior of generate index_name for columns...
It would be nice if this were fixed before I move up to 0.8... On Mon, Jul 18, 2011 at 3:19 PM, Boris Yen yulin...@gmail.com wrote: If it would not cause the dev team to much trouble, I think the cassandra should maintain the backward compatability regarding the generation of the default index_name, otherwise when people start dropping columns indices, the result might not be what they want. On Mon, Jul 18, 2011 at 7:59 PM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Jul 18, 2011 at 12:20 AM, Boris Yen yulin...@gmail.com wrote: Will this have any side effect when doing a get_indexed_slices No or when a user wants to drop an index by any means? Sort of; one of the indexes with the name will be dropped, but not all. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Why do Digest Queries return hash instead of timestamp?
I just saw this http://wiki.apache.org/cassandra/DigestQueries and I was wondering why it returns a hash of the data. Wouldn't it be better and easier to return the timestamp? You don't really care what the data is, you only care whether it is more or less recent than another piece of data.
Re: Why do Digest Queries return hash instead of timestamp?
If you have to pieces of data that are different but have the same timestamp, how can you resolve consistency? This is a pathological situation to begin with, why should you waste effort to (not) solve it? On Wed, Jul 13, 2011 at 12:05 PM, Boris Yen yulin...@gmail.com wrote: I guess it is because the timestamp does not guarantee data consistency, but hash does. Boris On Wed, Jul 13, 2011 at 4:27 PM, David Boxenhorn da...@citypath.comwrote: I just saw this http://wiki.apache.org/cassandra/DigestQueries and I was wondering why it returns a hash of the data. Wouldn't it be better and easier to return the timestamp? You don't really care what the data is, you only care whether it is more or less recent than another piece of data.
Re: Why do Digest Queries return hash instead of timestamp?
How would you know which data is correct, if they both have the same timestamp? On Wed, Jul 13, 2011 at 12:40 PM, Boris Yen yulin...@gmail.com wrote: I can only say, data does matter, that is why the developers use hash instead of timestamp. If hash value comes from other node is not a match, a read repair would perform. so that correct data can be returned. On Wed, Jul 13, 2011 at 5:08 PM, David Boxenhorn da...@citypath.comwrote: If you have to pieces of data that are different but have the same timestamp, how can you resolve consistency? This is a pathological situation to begin with, why should you waste effort to (not) solve it? On Wed, Jul 13, 2011 at 12:05 PM, Boris Yen yulin...@gmail.com wrote: I guess it is because the timestamp does not guarantee data consistency, but hash does. Boris On Wed, Jul 13, 2011 at 4:27 PM, David Boxenhorn da...@citypath.comwrote: I just saw this http://wiki.apache.org/cassandra/DigestQueries and I was wondering why it returns a hash of the data. Wouldn't it be better and easier to return the timestamp? You don't really care what the data is, you only care whether it is more or less recent than another piece of data.
Re: Why do Digest Queries return hash instead of timestamp?
Is that the actual reason? This seems like a big inefficiency to me. For those of us who don't worry about this extreme edge case (that probably will NEVER happen in real life, for most applications), is there a way to turn this off? Or am I wrong about this making the operation MUCH more expensive? On Wed, Jul 13, 2011 at 3:20 PM, Boris Yen yulin...@gmail.com wrote: For a specific column, If there are two versions with the same timestamp, the value of the column is used to break the tie. if v1.value().compareTo(v2.value()) 0, it means that v2 wins. On Wed, Jul 13, 2011 at 7:13 PM, David Boxenhorn da...@citypath.comwrote: How would you know which data is correct, if they both have the same timestamp? On Wed, Jul 13, 2011 at 12:40 PM, Boris Yen yulin...@gmail.com wrote: I can only say, data does matter, that is why the developers use hash instead of timestamp. If hash value comes from other node is not a match, a read repair would perform. so that correct data can be returned. On Wed, Jul 13, 2011 at 5:08 PM, David Boxenhorn da...@citypath.comwrote: If you have to pieces of data that are different but have the same timestamp, how can you resolve consistency? This is a pathological situation to begin with, why should you waste effort to (not) solve it? On Wed, Jul 13, 2011 at 12:05 PM, Boris Yen yulin...@gmail.com wrote: I guess it is because the timestamp does not guarantee data consistency, but hash does. Boris On Wed, Jul 13, 2011 at 4:27 PM, David Boxenhorn da...@citypath.comwrote: I just saw this http://wiki.apache.org/cassandra/DigestQueries and I was wondering why it returns a hash of the data. Wouldn't it be better and easier to return the timestamp? You don't really care what the data is, you only care whether it is more or less recent than another piece of data.
Re: Why do Digest Queries return hash instead of timestamp?
Got it. Thanks! On Wed, Jul 13, 2011 at 6:05 PM, Jonathan Ellis jbel...@gmail.com wrote: (1) the hash calculation is a small amount of CPU -- MD5 is specifically designed to be efficient in this kind of situation (2) we compute one hash per query, so for multiple columns the advantage over timestamp-per-column gets large quickly. On Wed, Jul 13, 2011 at 7:31 AM, David Boxenhorn da...@citypath.com wrote: Is that the actual reason? This seems like a big inefficiency to me. For those of us who don't worry about this extreme edge case (that probably will NEVER happen in real life, for most applications), is there a way to turn this off? Or am I wrong about this making the operation MUCH more expensive? On Wed, Jul 13, 2011 at 3:20 PM, Boris Yen yulin...@gmail.com wrote: For a specific column, If there are two versions with the same timestamp, the value of the column is used to break the tie. if v1.value().compareTo(v2.value()) 0, it means that v2 wins. On Wed, Jul 13, 2011 at 7:13 PM, David Boxenhorn da...@citypath.com wrote: How would you know which data is correct, if they both have the same timestamp? On Wed, Jul 13, 2011 at 12:40 PM, Boris Yen yulin...@gmail.com wrote: I can only say, data does matter, that is why the developers use hash instead of timestamp. If hash value comes from other node is not a match, a read repair would perform. so that correct data can be returned. On Wed, Jul 13, 2011 at 5:08 PM, David Boxenhorn da...@citypath.com wrote: If you have to pieces of data that are different but have the same timestamp, how can you resolve consistency? This is a pathological situation to begin with, why should you waste effort to (not) solve it? On Wed, Jul 13, 2011 at 12:05 PM, Boris Yen yulin...@gmail.com wrote: I guess it is because the timestamp does not guarantee data consistency, but hash does. Boris On Wed, Jul 13, 2011 at 4:27 PM, David Boxenhorn da...@citypath.com wrote: I just saw this http://wiki.apache.org/cassandra/DigestQueries and I was wondering why it returns a hash of the data. Wouldn't it be better and easier to return the timestamp? You don't really care what the data is, you only care whether it is more or less recent than another piece of data. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Questions about Cassandra reads
What do you think ? I think you should strongly consider denormalizing so that you can read ranges from a single row instead. Why do you recommend denormalizing instead of secondary indexes?
Re: Questions about Cassandra reads
Ah, I get it. Your normal access pattern should be one row at a time. On Sun, Jul 3, 2011 at 11:41 AM, David Boxenhorn da...@citypath.com wrote: What do you think ? I think you should strongly consider denormalizing so that you can read ranges from a single row instead. Why do you recommend denormalizing instead of secondary indexes?
Re: Truncate introspection
Does drop work in a similar way? When I drop a CF and add it back with a different schema, it seems to work. But I notice that in between the drop and adding it back, when the CLI tells me the CF doesn't exist, the old data is still there. I've been assuming that this works, but just wanted to make sure... On Tue, Jun 28, 2011 at 12:56 AM, Jonathan Ellis jbel...@gmail.com wrote: Each node (independently) has logic that guarantees that any writes processed before the truncate, will be wiped out. This does not mean that each node will wipe out the same data, or even that each node will process the truncate (which would result in a timedoutexception). It also does not mean you can't have writes immediately after the truncate that would race w/ a truncate, check for zero sstables procedure. On Mon, Jun 27, 2011 at 3:35 PM, Ethan Rowe et...@the-rowes.com wrote: If those went to zero, it would certainly tell me something happened. :) I guess watching that would be a way of seeing something was going on. Is the truncate itself propagating a ring-wide marker or anything so the CF is logically empty before being physically removed? That's the impression I got from the docs but it wasn't totally clear to me. On Mon, Jun 27, 2011 at 3:33 PM, Jonathan Ellis jbel...@gmail.com wrote: There's a JMX method to get the number of sstables in a CF, is that what you're looking for? On Mon, Jun 27, 2011 at 1:04 PM, Ethan Rowe et...@the-rowes.com wrote: Is there any straightforward means of seeing what's going on after issuing a truncate (on 0.7.5)? I'm not seeing evidence that anything actually happened. I've disabled read repair on the column family in question and don't have anything actively reading/writing at present, apart from my one-off tests to see if rows have disappeared. Thanks in advance. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: 99.999% uptime - Operations Best Practices?
I think very high uptime, and very low data loss is achievable in Cassandra, but, for new users there are TONS of gotchas. You really have to know what you're doing, and I doubt that many people acquire that knowledge without making a lot of mistakes. I see above that most people are talking about configuration issues. But, the first thing that you will probably do, before you have any experience with Cassandra(!), is architect your system. Architecture is not easily changed when you bump into a gotcha, and for some reason you really have to search the literature well to find out about them. So, my contributions: The too many CFs problem. Cassandra doesn't do well with many column families. If you come from a relational world, a real application can easily have hundreds of tables. Even if you combine them into entities (which is the Cassandra way), you can easily end up with dozens of entities. The most natural thing for someone with a relational background is have one CF per entity, plus indexes according to your needs. Don't do it. You need to store multiple entities in the same CF. Group them together according to access patterns (i.e. when you use X, you probably also need Y), and distinguish them by adding a prefix to their keys (e.g. entityName@key). Don't use supercolumns, use composite columns. Supercolumns are disfavored by the Cassandra community and are slowly being orphaned. For example, secondary indexes don't work on supercolumns. Nor does CQL. Bugs crop up with supercolumns that don't happen with regular columns because internally there's a huge separate code base for supercolumns, and every new feature is designed and implemented for regular columns and then retrofitted for supercolumns (or not). There should really be a database of gotchas somewhere, and how they were solved... On Thu, Jun 23, 2011 at 6:57 AM, Les Hazlewood l...@katasoft.com wrote: Edward, Thank you so much for this reply - this is great stuff, and I really appreciate it. You'll be happy to know that I've already pre-ordered your book. I'm looking forward to it! (When is the ship date?) Best regards, Les On Wed, Jun 22, 2011 at 7:03 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood l...@katasoft.com wrote: Hi Thoku, You were able to more concisely represent my intentions (and their reasoning) in this thread than I was able to do so myself. Thanks! On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen tho...@gmail.com wrote: I think that Les's question was reasonable. Why *not* ask the community for the 'gotchas'? Whether the info is already documented or not, it could be an opportunity to improve the documentation based on users' perception. The you just have to learn responses are fair also, but that reminds me of the days when running Oracle was a black art, and accumulated wisdom made DBAs irreplaceable. Yes, this was my initial concern. I know that Cassandra is still young, and I expect this to be the norm for a while, but I was hoping to make that process a bit easier (for me and anyone else reading this thread in the future). Some recommendations *are* documented, but they are dispersed / stale / contradictory / or counter-intuitive. Others have not been documented in the wiki nor in DataStax's doco, and are instead learned anecdotally or The Hard Way. For example, whether documented or not, some of the 'gotchas' that I encountered when I first started working with Cassandra were: * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says this, Jira says that). * Its not viable to run without JNA installed. * Disable swap memory. * Need to run nodetool repair on a regular basis. I'm looking forward to Edward Capriolo's Cassandra book which Les will probably find helpful. Thanks for linking to this. I'm pre-ordering right away. And thanks for the pointers, they are exactly the kind of enumerated things I was looking to elicit. These are the kinds of things that are hard to track down in a single place. I think it'd be nice for the community to contribute this stuff to a single page ('best practices', 'checklist', whatever you want to call it). It would certainly make things easier when getting started. Thanks again, Les Since I got a plug on the book I will chip in again to the thread :) Some things that were mentioned already: Install JNA absolutely (without JNA the snapshot command has to fork to hard link the sstables, I have seen clients backoff from this). Also the performance focused Cassandra devs always try to squeeze out performance by utilizing more native features. OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in production, this way you get surprised less. Other stuff: RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0 has better performance, but if you lose a node your capacity is diminished, rebuilding and rejoining a node involves more
Re: Replication-aware compaction
Thanks! I'm actually on vacation now, so I hope to look into this next week. On Mon, Jun 6, 2011 at 10:25 PM, aaron morton aa...@thelastpickle.com wrote: You should consider upgrading to 0.7.6 to get a fix to Gossip. Earlier 0.7 releases were prone to marking nodes up and down when they should not have been. See https://github.com/apache/cassandra/blob/cassandra-0.7/CHANGES.txt#L22 Are the TimedOutExceptions to the client for read or write requests ? During the burst times which stages are backing up nodetool tpstats ? Compaction should not affect writes too much (assuming different log and data spindles). You could also take a look at the read and write latency stats for a particular CF using nodetool cfstats or JConsole. These will give you the stats for the local operations. You could also take a look at the iostats on the box http://spyced.blogspot.com/2010/01/linux-performance-basics.html Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 7 Jun 2011, at 00:30, David Boxenhorn wrote: Version 0.7.3. Yes, I am talking about minor compactions. I have three nodes, RF=3. 3G data (before replication). Not many users (yet). It seems like 3 nodes should be plenty. But when all 3 nodes are compacting, I sometimes get timeouts on the client, and I see in my logs that each one is full of notifications that the other nodes have died (and come back to life after about a second). My cluster can tolerate one node being out of commission, so I would rather have longer compactions one at a time than shorter compactions all at the same time. I think that our usage pattern of bursty writes causes the three nodes to decide to compact at the same time. These bursts are followed by periods of relative quiet, so there should be time for the other two nodes to compact one at a time. On Mon, Jun 6, 2011 at 3:27 PM, David Boxenhorn da...@citypath.com wrote: Version 0.7.3. Yes, I am talking about minor compactions. I have three nodes, RF=3. 3G data (before replication). Not many users (yet). It seems like 3 nodes should be plenty. But when all 3 nodes are compacting, I sometimes get timeouts on the client, and I see in my logs that each one is full of notifications that the other nodes have died (and come back to life after about a second). My cluster can tolerate one node being out of commission, so I would rather have longer compactions one at a time than shorter compactions all at the same time. I think that our usage pattern of bursty writes causes the three nodes to decide to compact at the same time. These bursts are followed by periods of relative quiet, so there should be time for the other two nodes to compact one at a time. On Mon, Jun 6, 2011 at 2:36 PM, aaron morton aa...@thelastpickle.com wrote: Are you talking about minor (automatic) compactions ? Can you provide some more information on what's happening to make the node unusable and what version you are using? It's not lightweight process, but it should not hurt the node that badly. It is considered an online operation. Delaying compaction will only make it run for longer and take more resources. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 6 Jun 2011, at 20:14, David Boxenhorn wrote: Is there some deep architectural reason why compaction can't be replication-aware? What I mean is, if one node is doing compaction, its replicas shouldn't be doing compaction at the same time. Or, at least a quorum of nodes should be available at all times. For example, if RF=3, and one node is doing compaction, the nodes to its right and left in the ring should wait on compaction until that node is done. Of course, my real problem is that compaction makes a node pretty much unavailable. If we can fix that problem then this is not necessary.
Replication-aware compaction
Is there some deep architectural reason why compaction can't be replication-aware? What I mean is, if one node is doing compaction, its replicas shouldn't be doing compaction at the same time. Or, at least a quorum of nodes should be available at all times. For example, if RF=3, and one node is doing compaction, the nodes to its right and left in the ring should wait on compaction until that node is done. Of course, my real problem is that compaction makes a node pretty much unavailable. If we can fix that problem then this is not necessary.
Re: [SPAM] Re: slow insertion rate with secondary index
Is there really a 10x difference between indexed CFs and non-indexed CFs? On Mon, Jun 6, 2011 at 11:05 AM, Donal Zang zan...@ihep.ac.cn wrote: On 06/06/2011 05:38, Jonathan Ellis wrote: Index updates require read-before-write (to find out what the prior version was, if any, and update the index accordingly). This is random i/o. Index creation on the other hand is a lot of sequential i/o, hence more efficient. So, the classic bulk load advice to ingest data prior to creating indexes applies. Thanks for the explanation! -- Donal Zang Computing Center, IHEP 19B YuquanLu, Shijingshan District,Beijing, 100049 zan...@ihep.ac.cn 86 010 8823 6018
Re: [SPAM] Re: slow insertion rate with secondary index
Jonathan, are Donal Zang's results (10x slowdown) typical? On Mon, Jun 6, 2011 at 3:14 PM, Jonathan Ellis jbel...@gmail.com wrote: On Mon, Jun 6, 2011 at 6:28 AM, Donal Zang zan...@ihep.ac.cn wrote: Another thing I noticed is : if you first do insertion, and then build the secondary index use update column family ..., and then do select based on the index, the result is not right (seems the index is still being built though the update commands returns quickly). That is correct. describe keyspace from the cli tells you when an index has finished building. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Replication-aware compaction
Version 0.7.3. Yes, I am talking about minor compactions. I have three nodes, RF=3. 3G data (before replication). Not many users (yet). It seems like 3 nodes should be plenty. But when all 3 nodes are compacting, I sometimes get timeouts on the client, and I see in my logs that each one is full of notifications that the other nodes have died (and come back to life after about a second). My cluster can tolerate one node being out of commission, so I would rather have longer compactions one at a time than shorter compactions all at the same time. I think that our usage pattern of bursty writes causes the three nodes to decide to compact at the same time. These bursts are followed by periods of relative quiet, so there should be time for the other two nodes to compact one at a time. On Mon, Jun 6, 2011 at 3:27 PM, David Boxenhorn da...@citypath.com wrote: Version 0.7.3. Yes, I am talking about minor compactions. I have three nodes, RF=3. 3G data (before replication). Not many users (yet). It seems like 3 nodes should be plenty. But when all 3 nodes are compacting, I sometimes get timeouts on the client, and I see in my logs that each one is full of notifications that the other nodes have died (and come back to life after about a second). My cluster can tolerate one node being out of commission, so I would rather have longer compactions one at a time than shorter compactions all at the same time. I think that our usage pattern of bursty writes causes the three nodes to decide to compact at the same time. These bursts are followed by periods of relative quiet, so there should be time for the other two nodes to compact one at a time. On Mon, Jun 6, 2011 at 2:36 PM, aaron morton aa...@thelastpickle.com wrote: Are you talking about minor (automatic) compactions ? Can you provide some more information on what's happening to make the node unusable and what version you are using? It's not lightweight process, but it should not hurt the node that badly. It is considered an online operation. Delaying compaction will only make it run for longer and take more resources. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 6 Jun 2011, at 20:14, David Boxenhorn wrote: Is there some deep architectural reason why compaction can't be replication-aware? What I mean is, if one node is doing compaction, its replicas shouldn't be doing compaction at the same time. Or, at least a quorum of nodes should be available at all times. For example, if RF=3, and one node is doing compaction, the nodes to its right and left in the ring should wait on compaction until that node is done. Of course, my real problem is that compaction makes a node pretty much unavailable. If we can fix that problem then this is not necessary.
CQL: Select for multiple ranges
In order to fully implement the functionality of super columns using compound columns I need to be able to select multiple column ranges - this would be functionally equivalent to selecting multiple super columns (and more!). I would like to request the following CQL syntax: SELECT [FIRST N] [REVERSED] name1..nameN1, name2..nameN2... FROM ... I am heading into my weekend here. If no one has created a JIRA ticket for this by Sunday, and I am not talked out of it, I will create one myself.
Re: Using composite column names in the CLI
This is what I'm talking about https://issues.apache.org/jira/browse/CASSANDRA-2231 The on-disk format is (short)lengthconstituentend byte = 0(short)lengthconstituentend byte = 0... I would like to be able to input these kinds of keys into the CLI, something like set cf[key]['constituent1':'constituent2':'constituent3'] = val On Tue, May 17, 2011 at 2:15 AM, Sameer Farooqui cassandral...@gmail.comwrote: Cassandra wouldn't know that the column name is composite of two different things. So you could just request the column names and values for a specific key like this and then just look at the column names that get returned: [default@MyKeyspace] get DemoCF[ascii('key_42')]; = (column=CA_SanJose, value=50, timestamp=1305236885112000) = (column=CA_PaloAlto, value=49, timestamp=1305236885192000) = (column=FL_Orlando, value=45, timestamp=130523688528) = (column=NY_NYC, value=40, timestamp=1305236885361000) And I'm not sure what you mean by inputting composite column names. You just input them like any other column name: [default@MyKeyspace] set DemoCF['key_42']['CA_SanJose']='51'; Value inserted. On Mon, May 16, 2011 at 2:34 PM, Aaron Morton aa...@thelastpickle.comwrote: What do you mean by composite column names? Do the data type functions supported by get and set help? Or the assume statement? Aaron On 17/05/2011, at 3:21 AM, David Boxenhorn da...@taotown.com wrote: Is there a way to view composite column names in the CLI? Is there a way to input them (i.e. in the set command)?
Re: Using composite column names in the CLI
Excellent! (I presume there is some way of representing :, like \:?) On Tue, May 17, 2011 at 11:44 AM, Sylvain Lebresne sylv...@datastax.comwrote: Provided you're working on a branch that has CASSANDRA-2231 applied (that's either the cassandra-0.8.1 branch or trunk), this work 'out of the box': The setup will look like: [default@unknown] create keyspace test; [default@unknown] use test; [default@test] create column family testCF with comparator='CompositeType(AsciiType, IntegerType(reversed=true), IntegerType)' and default_validation_class=AsciiType; Then: [default@test] set testCF[a]['foo:24:24'] = 'v1'; Value inserted. [default@test] set testCF[a]['foo:42:24'] = 'v2'; Value inserted. [default@test] set testCF[a]['foobar:42:24'] = 'v3'; Value inserted. [default@test] set testCF[a]['boobar:42:24'] = 'v4'; Value inserted. [default@test] set testCF[a]['boobar:42:42'] = 'v5'; Value inserted. [default@test] get testCF[a]; = (column=boobar:42:24, value=v4, timestamp=1305621115813000) = (column=boobar:42:42, value=v5, timestamp=1305621125563000) = (column=foo:42:24, value=v2, timestamp=1305621096473000) = (column=foo:24:24, value=v1, timestamp=1305621085548000) = (column=foobar:42:24, value=v3, timestamp=1305621110813000) Returned 5 results. -- Sylvain On Tue, May 17, 2011 at 9:20 AM, David Boxenhorn da...@taotown.com wrote: This is what I'm talking about https://issues.apache.org/jira/browse/CASSANDRA-2231 The on-disk format is (short)lengthconstituentend byte = 0(short)lengthconstituentend byte = 0... I would like to be able to input these kinds of keys into the CLI, something like set cf[key]['constituent1':'constituent2':'constituent3'] = val On Tue, May 17, 2011 at 2:15 AM, Sameer Farooqui cassandral...@gmail.com wrote: Cassandra wouldn't know that the column name is composite of two different things. So you could just request the column names and values for a specific key like this and then just look at the column names that get returned: [default@MyKeyspace] get DemoCF[ascii('key_42')]; = (column=CA_SanJose, value=50, timestamp=1305236885112000) = (column=CA_PaloAlto, value=49, timestamp=1305236885192000) = (column=FL_Orlando, value=45, timestamp=130523688528) = (column=NY_NYC, value=40, timestamp=1305236885361000) And I'm not sure what you mean by inputting composite column names. You just input them like any other column name: [default@MyKeyspace] set DemoCF['key_42']['CA_SanJose']='51'; Value inserted. On Mon, May 16, 2011 at 2:34 PM, Aaron Morton aa...@thelastpickle.com wrote: What do you mean by composite column names? Do the data type functions supported by get and set help? Or the assume statement? Aaron On 17/05/2011, at 3:21 AM, David Boxenhorn da...@taotown.com wrote: Is there a way to view composite column names in the CLI? Is there a way to input them (i.e. in the set command)?
Re: Import/Export of Schema Migrations
What you describe below sounds like what I want to do. I think that the only additional thing I am requesting is to export the migrations from the dev cluster (since Cassandra already has a table that saves them - I just want that information!) so I can import it to the other clusters. This would ensure that my migrations are exactly right, without being dependent on error-prone human intervention. To really get rid of human intervention it would be nice to be able to mark a certain migration with a version name. Then I could say something like, export migrations version1.2.3 to version1.2.4 and I would get the exact migration path from one version to another. On Mon, May 16, 2011 at 1:04 AM, aaron morton aa...@thelastpickle.comwrote: personal preference Not sure what sort of changes you are making, but this is my approach. I've always managed database (my sql, sql server whatever) schema as source code (SQL DDL statements, CLI script etc). It makes it a lot easier to cold start the system, test changes and see who changed what. Once you have your initial schema you can hand roll a CLI script to update / drop existing CF's. For the update column family statement all the attributes are delta to the current setting, i.e. you do not need to say comparator is ascii again. Other than the indexes, you need to specify all the indexes again those not included will be dropped. If you want to be able to replay multiple schema changes made during dev against other clusters my personal approach would be: - create a cli script for every change (using update and delete CF), prefixed with 000X so you can see the order. - manage the scripts in source control - sanity check to see if they can be collapsed - replay the changes in order when applying them to a cluster. (you will still need to manually delete data from dropped cf's) changes to conf/cassandra.yaml can be managed using chef /person preference Others will have different ideas - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 14 May 2011, at 00:15, David Boxenhorn wrote: Actually, I want a way to propagate *any* changes from development to staging to production, but schema changes are the most important. Could I use 2221 to propagate schema changes by deleting the schema in the target cluster, doing show schema in the source cluster, redirecting to a file, and running the file as a script in the target cluster? Of course, I would have to delete the files of dropped CFs by hand (something I *hate* to do, because I'm afraid of making a mistake), but it would be a big improvement. I am open to any other ideas of how to propagate changes from one cluster to another in an efficient non-error-prone fashion. Our development environment (i.e. development, staging, production) is pretty standard, so I'm sure that I'm not the only one with this problem! On Fri, May 13, 2011 at 12:51 PM, aaron morton aa...@thelastpickle.comwrote: What sort of schema changes are you making? can you manage them as a CLI script under source control ? You may also be interested in CASSANDRA-2221. Cheers Aaron - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 12 May 2011, at 20:45, David Boxenhorn wrote: My use case is like this: I have a development cluster, a staging cluster and a production cluster. When I finish a set of migrations (i.e. changes) on the development cluster, I want to apply them to the staging cluster, and eventually the production cluster. I don't want to do it by hand, because it's a painful and error-prone process. What I would like to do is export the last N migrations from the development cluster as a text file, with exactly the same format as the original text commands, and import them to the staging and production clusters. I think the best place to do this might be the CLI, since you would probably want to view your migrations before exporting them. Something like this: show migrations N;Shows the last N migrations. export migrations N fileName; Exports the last N migrations to file fileName. import migrations fileName; Imports migrations from fileName. The import process would apply the migrations one at a time giving you feedback like, applying migration: update column family If a migration fails, the process should give an appropriate message and stop. Is anyone else interested in this? I have created a Jira ticket for it here: https://issues.apache.org/jira/browse/CASSANDRA-2636
Using composite column names in the CLI
Is there a way to view composite column names in the CLI? Is there a way to input them (i.e. in the set command)?
Re: Import/Export of Schema Migrations
Actually, I want a way to propagate *any* changes from development to staging to production, but schema changes are the most important. Could I use 2221 to propagate schema changes by deleting the schema in the target cluster, doing show schema in the source cluster, redirecting to a file, and running the file as a script in the target cluster? Of course, I would have to delete the files of dropped CFs by hand (something I *hate* to do, because I'm afraid of making a mistake), but it would be a big improvement. I am open to any other ideas of how to propagate changes from one cluster to another in an efficient non-error-prone fashion. Our development environment (i.e. development, staging, production) is pretty standard, so I'm sure that I'm not the only one with this problem! On Fri, May 13, 2011 at 12:51 PM, aaron morton aa...@thelastpickle.comwrote: What sort of schema changes are you making? can you manage them as a CLI script under source control ? You may also be interested in CASSANDRA-2221. Cheers Aaron - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 12 May 2011, at 20:45, David Boxenhorn wrote: My use case is like this: I have a development cluster, a staging cluster and a production cluster. When I finish a set of migrations (i.e. changes) on the development cluster, I want to apply them to the staging cluster, and eventually the production cluster. I don't want to do it by hand, because it's a painful and error-prone process. What I would like to do is export the last N migrations from the development cluster as a text file, with exactly the same format as the original text commands, and import them to the staging and production clusters. I think the best place to do this might be the CLI, since you would probably want to view your migrations before exporting them. Something like this: show migrations N;Shows the last N migrations. export migrations N fileName; Exports the last N migrations to file fileName. import migrations fileName; Imports migrations from fileName. The import process would apply the migrations one at a time giving you feedback like, applying migration: update column family If a migration fails, the process should give an appropriate message and stop. Is anyone else interested in this? I have created a Jira ticket for it here: https://issues.apache.org/jira/browse/CASSANDRA-2636
Import/Export of Schema Migrations
My use case is like this: I have a development cluster, a staging cluster and a production cluster. When I finish a set of migrations (i.e. changes) on the development cluster, I want to apply them to the staging cluster, and eventually the production cluster. I don't want to do it by hand, because it's a painful and error-prone process. What I would like to do is export the last N migrations from the development cluster as a text file, with exactly the same format as the original text commands, and import them to the staging and production clusters. I think the best place to do this might be the CLI, since you would probably want to view your migrations before exporting them. Something like this: show migrations N;Shows the last N migrations. export migrations N fileName; Exports the last N migrations to file fileName. import migrations fileName; Imports migrations from fileName. The import process would apply the migrations one at a time giving you feedback like, applying migration: update column family If a migration fails, the process should give an appropriate message and stop. Is anyone else interested in this? I have created a Jira ticket for it here: https://issues.apache.org/jira/browse/CASSANDRA-2636
Re: compaction strategy
I'm also not too much in favor of triggering major compactions, because it mostly have a nasty effect (create one huge sstable). If that is the case, why can't major compactions create many, non-overlapping SSTables? In general, it seems to me that non-overlapping SSTables have all the advantages of big SSTables (i.e. you know exactly where the data is) without the disadvantages that come with being big. Why doesn't Cassandra take advantage of that in a major way?
Re: compaction strategy
If they each have their own copy of the data, then they are *not* non-overlapping! If you have non-overlapping SSTables (and you know the min/max keys), it's like having one big SSTable because you know exactly where each row is, and it becomes easy to merge a new SSTable in small batches, rather than in one huge batch. The only step that you have to add to the current merge process is, when you going to write a new SSTable, if it's too big, to write N (non-overlapping!) pieces instead. On Mon, May 9, 2011 at 12:46 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Yes, agreed. I actually think cassandra has to. And if you do not go down to that single file, how do you avoid getting into a situation where you can very realistically end up with 4-5 big sstables each having its own copy of the same data massively increasing disk requirements? Terje On Mon, May 9, 2011 at 5:58 PM, David Boxenhorn da...@taotown.com wrote: I'm also not too much in favor of triggering major compactions, because it mostly have a nasty effect (create one huge sstable). If that is the case, why can't major compactions create many, non-overlapping SSTables? In general, it seems to me that non-overlapping SSTables have all the advantages of big SSTables (i.e. you know exactly where the data is) without the disadvantages that come with being big. Why doesn't Cassandra take advantage of that in a major way?
Cassandra and JCR
I think this is a question specifically for Patricio Echagüe, though I welcome answers from anyone else who can contribute... We are considering using Magnolia as a CMS. Magnolia uses Jackrabbit for its data storage. Jackrabbit is a JCR implementation. Questions: 1. Can we plug Cassandra into JCR/Jackrabbit as its data storage? 2. I see that some work has already been done on this issue (specifically, I see that Patrico was involved in this). Where does that work stand now? Is this a viable option for us? 3. How much work would it be for us? 4. What are the issues involved?
Cassandra CMS
Does anyone know of a content management system that can be easily customized to use Cassandra as its database? (Even better, if it can use Cassandra without customization!)
Re: Cassandra CMS
I'm looking at Magnolia at the moment (as in, this second). At first glance, it looks like I should be able to use Cassandra as the database: http://documentation.magnolia-cms.com/technical-guide/content-storage-and-structure.html#Persistent_storage If it can use a filesystem as its database, it can use Cassandra, no? On Thu, May 5, 2011 at 2:01 PM, aaron morton aa...@thelastpickle.comwrote: Would you think of Django as a CMS ? http://stackoverflow.com/questions/2369793/how-to-use-cassandra-in-django-framework http://stackoverflow.com/questions/2369793/how-to-use-cassandra-in-django-framework Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 5 May 2011, at 22:54, Eric tamme wrote: Does anyone know of a content management system that can be easily customized to use Cassandra as its database? (Even better, if it can use Cassandra without customization!) I think your best bet will be to look for a CMS that uses an ORM for the storage layer and write a specific ORM for Cassandra that you can plugin to whatever frame work the CMS uses. -Eric
Compound columns spec
Is there a spec for compound columns? I want to know the exact format of compound columns so I can adhere to it. For example, what is the separator - or is some other format used (e.g. length:value or type:length:value)?
Re: Compound columns spec
Thanks, yes, I was referring to the compound columns in this quote (from a previous thread): No CQL will never support super columns, but later versions (not 1.0.0) will support compound columns. Compound columns are better; instead of a two-deep structure, you can have one of arbitrary depth. I would like to design my keys to take advantage of this future development, when it comes. On Thu, May 5, 2011 at 5:53 PM, Sylvain Lebresne sylv...@datastax.comwrote: I suppose it depends what you are referring to by compound columns. If you're talking about the CompositeType of CASSANDRA-2231 (which is my only guess), then the format is in the javadoc and is: /* * The encoding of a CompositeType column name should be: * componentcomponentcomponent ... * where component is: * length of valuevalue'end-of-component' byte * where the 'end-of-component' byte should always be 0 for actual column * name. However, it can set to 1 for query bounds. This allows to query for * the equivalent of 'give me the full super-column'. That is, if during a * slice query uses: * start = 3foo.getBytes()0 * end = 3foo.getBytes()1 * then he will be sure to get *all* the columns whose first component is foo. * If for a component, the 'end-of-component' is != 0, there should not be any * following component. */ I'll mention that this is not committed code yet (but soon hopefully and the format shouldn't change). -- Sylvain On Thu, May 5, 2011 at 4:44 PM, David Boxenhorn da...@taotown.com wrote: Is there a spec for compound columns? I want to know the exact format of compound columns so I can adhere to it. For example, what is the separator - or is some other format used (e.g. length:value or type:length:value)?
Re: Compound columns spec
What is the format of length of value ? On Thu, May 5, 2011 at 6:14 PM, Eric Evans eev...@rackspace.com wrote: On Thu, 2011-05-05 at 17:44 +0300, David Boxenhorn wrote: Is there a spec for compound columns? I want to know the exact format of compound columns so I can adhere to it. For example, what is the separator - or is some other format used (e.g. length:value or type:length:value)? Tentatively, CQL will use colon delimited terms like this, yes (tentatively). -- Eric Evans eev...@rackspace.com
One cluster or many?
If I have a database that partitions naturally into non-overlapping datasets, in which there are no references between datasets, where each dataset is quite large (i.e. large enough to merit its own cluster from the point of view of quantity of data), should I set up one cluster per database or one large cluster for everything together? As I see it: The primary advantage of separate clusters is total isolation: if I have a problem with one dataset, my application will continue working normally for all other datasets. The primary advantage of one big cluster is usage pooling: when one server goes down in a large cluster it's much less important than when one server goes down in a small cluster. Also, different temporal usage patterns of the different datasets (i.e. there will be different peak hours on different datasets) can be combined to ease capacity requirements. Any thoughts?
Terrible CQL idea: and aliases of = and =
Is this still true? *Note: The greater-than and less-than operators ( and ) result in key ranges that are inclusive of the terms. There is no supported notion of “strictly” greater-than or less-than; these operators are merely supported as aliases to = and =. * I think that making and aliases of = and = is a terrible idea! First of all, it is very misleading. Second, what will happen to old code when and are really supported? (*Some* day they will be supported!)
Re: Combining all CFs into one big one
I guess I'm still feeling fuzzy on this because my actual use-case isn't so black-and-white. I don't have any CFs that are accessed purely, or even mostly, in once-through batch mode. What I have is CFs with more and less data, and CFs that are accessed more and less frequently. On Mon, May 2, 2011 at 7:52 PM, Tyler Hobbs ty...@datastax.com wrote: On Mon, May 2, 2011 at 5:05 AM, David Boxenhorn da...@taotown.com wrote: Wouldn't it be the case that the once-used rows in your batch process would quickly be traded out of the cache, and replaced by frequently-used rows? Yes, and you'll pay a cache miss penalty for each of the replacements. This would be the case even if your batch process goes on for a long time, since caching is done on a row-by-row basis. In effect, it would mean that part of your cache is taken up by the batch process, much as if you dedicated a permanent cache to the batch - except that it isn't permanent, so it's better! Right, but we didn't want to cache any of the batch CF in the first place, because caching that CF is worth very little. With separate CFs, we could explicitly give it no cache. Now we have no control over how much of the cache it evicts.
Combining all CFs into one big one
I'm having problems administering my cluster because I have too many CFs (~40). I'm thinking of combining them all into one big CF. I would prefix the current CF name to the keys, repeat the CF name in a column, and index the column (so I can loop over all rows, which I have to do sometimes, for some CFs). Can anyone think of any disadvantages to this approach?
Re: Combining all CFs into one big one
Shouldn't these kinds of problems be solved by Cassandra? Isn't there a maximum SSTable size? On Sun, May 1, 2011 at 3:24 PM, shimi shim...@gmail.com wrote: Big sstables, long compactions, in major compaction you will need to have free disk space in the size of all the sstables (which you should have anyway). Shimi On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn da...@taotown.com wrote: I'm having problems administering my cluster because I have too many CFs (~40). I'm thinking of combining them all into one big CF. I would prefix the current CF name to the keys, repeat the CF name in a column, and index the column (so I can loop over all rows, which I have to do sometimes, for some CFs). Can anyone think of any disadvantages to this approach?
Re: Combining all CFs into one big one
If you had one big cache, wouldn't it be the case that it's mostly populated with frequently accessed rows, and less populated with rarely accessed rows? In fact, wouldn't one big cache dynamically and automatically give you exactly what you want? If you try to partition the same amount of memory manually, by guesswork, among many tables, aren't you always going to do a worse job? On Sun, May 1, 2011 at 10:43 PM, Tyler Hobbs ty...@datastax.com wrote: On Sun, May 1, 2011 at 2:16 PM, Jake Luciani jak...@gmail.com wrote: On Sun, May 1, 2011 at 2:58 PM, shimi shim...@gmail.com wrote: On Sun, May 1, 2011 at 9:48 PM, Jake Luciani jak...@gmail.com wrote: If you have N column families you need N * memtable size of RAM to support this. If that's not an option you can merge them into one as you suggest but then you will have much larger SSTables, slower compactions, etc. I don't necessarily agree with Tyler that the OS cache will be less effective... But I do agree that if the sizes of sstables are too large for you then more hardware is the solution... If you merge CFs which are hardly accessed with one which are accessed frequently, when you read the SSTable you load data that is hardly accessed to the OS cache. Only the rows or portions of rows you read will be loaded into the OS cache. Just because different rows are in the same file doesn't mean the entire file is loaded into the OS cache. The bloom filter and index file will be loaded but those are not large files. Right -- it does depend on the page size and the average amount of data read. The effect will be more pronounced on CFs with small rows that those with wide rows.
Re: Indexes on heterogeneous rows
Thanks, Jonathan. I think I understand now. To sum up: Everything would work, but if your only equality is on type (all the rest inequalities), it could be very inefficient. Is that right? On Thu, Apr 14, 2011 at 7:22 PM, Jonathan Ellis jbel...@gmail.com wrote: On Thu, Apr 14, 2011 at 6:48 AM, David Boxenhorn da...@taotown.com wrote: The reason why I put type first is that queries on type will always be an exact match, whereas the other clauses might be inequalities. Expression order doesn't matter, but as you imply, non-equalities can't be used in an index lookup and have to be checked in a nested loop phase afterwards. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Indexes on heterogeneous rows
Thank you for your answer, and sorry about the sloppy terminology. I'm thinking of the scenario where there are a small number of results in the result set, but there are billions of rows in the first of your secondary indexes. That is, I want to do something like (not sure of the CQL syntax): select * where type=2 and e=5 where there are billions of rows of type 2, but some manageable number of those rows have e=5. As I understand it, secondary indexes are like column families, where each value is a column. So the billions of rows where type=2 would go into a single row of the secondary index. This sounds like a problem to me, is it? I'm assuming that the billions of rows that don't have column e at all (those rows of other types) are not a problem at all... On Thu, Apr 14, 2011 at 12:12 PM, aaron morton aa...@thelastpickle.comwrote: Need to clear up some terminology here. Rows have a key and can be retrieved by key. This is *sort of* the primary index, but not primary in the normal RDBMS sense. Rows can have different columns and the column names are sorted and can be efficiently selected. There are secondary indexes in cassandra 0.7 based on column values http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes So you could create secondary indexes on the a,e, and h columns and get rows that have specific values. There are some limitations to secondary indexes, read the linked article. Or you can make your own secondary indexes using row keys as the index values. If you have billions of rows, how many do you need to read back at once? Hope that helps Aaron On 14 Apr 2011, at 04:23, David Boxenhorn wrote: Is it possible in 0.7.x to have indexes on heterogeneous rows, which have different sets of columns? For example, let's say you have three types of objects (1, 2, 3) which each had three members. If your rows had the following pattern type=1 a=? b=? c=? type=2 d=? e=? f=? type=3 g=? h=? i=? could you index type as your primary index, and also index a, e, h as secondary indexes, to get the objects of that type that you are looking for? Would it work if you had billions of rows of each type?
Indexes on heterogeneous rows
Is it possible in 0.7.x to have indexes on heterogeneous rows, which have different sets of columns? For example, let's say you have three types of objects (1, 2, 3) which each had three members. If your rows had the following pattern type=1 a=? b=? c=? type=2 d=? e=? f=? type=3 g=? h=? i=? could you index type as your primary index, and also index a, e, h as secondary indexes, to get the objects of that type that you are looking for? Would it work if you had billions of rows of each type?
Re: Double ColumnType and comparing
I you do it, I'd recommend BigDecimal. It's an exact type, and usually what you want. On Mon, Mar 14, 2011 at 3:40 PM, Jonathan Ellis jbel...@gmail.com wrote: We'd be happy to commit a patch contributing a DoubleType. On Sun, Mar 13, 2011 at 7:36 PM, Paul Teasdale teasda...@gmail.com wrote: I am quite new to Cassandra and am trying to model a simple Column Family which uses Doubles as column names: Datalines: { // ColumnFamilly dataline-1:{ // row key 23.5: 'someValue', 23.6: 'someValue', ... 4334.99: 'someValue' }, dataline-2:{ 10.5: 'someValue', 12.6: 'someValue', ... 23334.99: 'someValue' }, ... dataline-n:{ 10.5: 'someValue', 12.6: 'someValue', ... 23334.99: 'someValue' } } In declaring this column family, I need to specify a 'CompareWith' attribute for a Double type, but the only available values I found for this attribute are: * BytesType * AsciiType * UTF8Type * LongType * LexicalUUIDType * TimeUUIDType Is there any support anywere for double values (there has to be something)? And if not, does this mean we need to extend org.apache.cassandra.db.marshal.AbstractTypeDouble? package com.mycom.types; class DoubleType extends org.apache.cassandra.db.marshal.AbstractTypeDouble { public int compare(ByteBuffer o1, ByteBuffer o2){ // trivial implementation Double d1 = o1.getDouble(0); Double d2 = o2.getDoube(0); return d1.compareTo(d2); } //... } And declare the column family: ColumnFamily CompareWith=com.mycom.types.DoubleType Name=Datalines/ Thanks, Paul -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: On 0.6.6 to 0.7.3 migration, DC-aware traffic and minimising data transfer
How do you write to two versions of Cassandra from the same client? Two versions of Hector? On Mon, Mar 14, 2011 at 6:46 PM, Robert Coli rc...@digg.com wrote: On Mon, Mar 14, 2011 at 8:39 AM, Jedd Rashbrooke j...@visualdna.com wrote: But more importantly for us it would mean we'd have just the one major outage, rather than two (relocation and 0.6 - 0.7) Take zero major outages instead? :D a) Set up new cluster on new version. b) Fork application writes, so all writes go to both clusters. c) Backfill old data to new cluster via API writes. d) Flip the switch to read from the new cluster. e) Turn off old cluster. =Rob
Re: Nodes frozen in GC
If RF=2 and CL= QUORUM, you're getting no benefit from replication. When a node is in GC it stops everything. Set RF=3, so when one node is busy the cluster will still work. On Tue, Mar 8, 2011 at 11:46 AM, ruslan usifov ruslan.usi...@gmail.comwrote: 2011/3/8 Chris Goffinet c...@chrisgoffinet.com How large are your SSTables on disk? My thought was because you have so many on disk, we have to store the bloom filter + every 128 keys from index in memory. 0.5GB But as I understand store in memory happens only when read happens, i do only inserts. And i think that memory doesn't problem, because heap allocations looks like saw (in max Heap allocations get about 5,5 GB then they reduce to 2GB) Also when i increase Heap Size to 7GB, situation stay mach better, but nodes frozen still happens, and in gc.log I steel see: Total time for which application threads were stopped: 20.0686307 seconds lines (right not so often, like before)
Re: Exceptions on 0.7.0
Shimi, I am getting the same error that you report here. What did you do to solve it? David On Thu, Feb 10, 2011 at 2:54 PM, shimi shim...@gmail.com wrote: I upgraded the version on all the nodes but I still gets the Exceptions. I run cleanup on one of the nodes but I don't think there is any cleanup going on. Another weird thing that I see is: INFO [CompactionExecutor:1] 2011-02-10 12:08:21,353 CompactionIterator.java (line 135) Compacting large row 333531353730363835363237353338383836383035363036393135323132383 73630323034313a446f20322e384c20656e67696e657320686176652061646a75737461626c65206c696674657273 (725849473109 bytes) incrementally In my production version the largest row is 10259. It shouldn't be different in this case. The first Exception is been thrown on 3 nodes during compaction. The second Exception (Internal error processing get_range_slices) is been thrown all the time by a forth node. I disabled gossip and any client traffic to it and I still get the Exceptions. Is it possible to boot a node with gossip disable? Shimi On Thu, Feb 10, 2011 at 11:11 AM, aaron morton aa...@thelastpickle.comwrote: I should be able to repair, install the new version and kick off nodetool repair . If you are uncertain search for cassandra-1992 on the list, there has been some discussion. You can also wait till some peeps in the states wake up if you want to be extra sure. The number if the number of columns the iterator is going to return from the row. I'm guessing that because this happening during compaction it's using asked for the maximum possible number of columns. Aaron On 10 Feb 2011, at 21:37, shimi wrote: On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan No. I was too puzzled by the numbers On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.com wrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron I will. I hope the data was not corrupted. On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.comwrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan *From:* shimi [mailto:shim...@gmail.com] *Sent:* February-09-11 15:06 *To:* user@cassandra.apache.org *Subject:* Exceptions on 0.7.0 I have a 4 node test cluster were I test the port to 0.7.0 from 0.6.X On 3 out of the 4 nodes I get exceptions in the log. I am using RP. Changes that I did: 1. changed the replication factor from 3 to 4 2. configured the nodes to use Dynamic Snitch 3. RR of 0.33 I run repair on 2 nodes before I noticed the errors. One of them is having the first error and the other the second. I restart the nodes but I still get the exceptions. The following Exception I get from 2 nodes: WARN [CompactionExecutor:1] 2011-02-09 19:50:51,281 BloomFilter.java (line 84) Cannot provide an optimal Bloom Filter for 1986622313 elements (1/4 buckets per element). ERROR [CompactionExecutor:1] 2011-02-09 19:51:10,190 AbstractCassandraDaemon.java (line 91) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:105) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:34) at org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:284) at org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326) at org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:68) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at com.google.common.collect.Iterators$7.computeNext(Iterators.java:604) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.cassandra.db.ColumnIndexer.serializeInternal(ColumnIndexer.java:76) at org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:50) at org.apache.cassandra.io.LazilyCompactedRow.init(LazilyCompactedRow.java:88) at
Re: Exceptions on 0.7.0
Thanks, Shimi. I'll keep you posted if we make progress. Riptano is working on this problem too. On Tue, Feb 22, 2011 at 3:30 PM, shimi shim...@gmail.com wrote: I didn't solved it. Since it is a test cluster I deleted all the data. I copied some sstables from my production cluster and I tried again, this time I didn't have this problem. I am planing on removing everything from this test cluster. I will start all over again with 0.6.x , then I will load it with 10th of GB of data (not sstable copy) and test the upgrade again. I did a mistake that I didn't backup the data files before I upgraded. Shimi On Tue, Feb 22, 2011 at 2:24 PM, David Boxenhorn da...@lookin2.comwrote: Shimi, I am getting the same error that you report here. What did you do to solve it? David On Thu, Feb 10, 2011 at 2:54 PM, shimi shim...@gmail.com wrote: I upgraded the version on all the nodes but I still gets the Exceptions. I run cleanup on one of the nodes but I don't think there is any cleanup going on. Another weird thing that I see is: INFO [CompactionExecutor:1] 2011-02-10 12:08:21,353 CompactionIterator.java (line 135) Compacting large row 333531353730363835363237353338383836383035363036393135323132383 73630323034313a446f20322e384c20656e67696e657320686176652061646a75737461626c65206c696674657273 (725849473109 bytes) incrementally In my production version the largest row is 10259. It shouldn't be different in this case. The first Exception is been thrown on 3 nodes during compaction. The second Exception (Internal error processing get_range_slices) is been thrown all the time by a forth node. I disabled gossip and any client traffic to it and I still get the Exceptions. Is it possible to boot a node with gossip disable? Shimi On Thu, Feb 10, 2011 at 11:11 AM, aaron morton aa...@thelastpickle.comwrote: I should be able to repair, install the new version and kick off nodetool repair . If you are uncertain search for cassandra-1992 on the list, there has been some discussion. You can also wait till some peeps in the states wake up if you want to be extra sure. The number if the number of columns the iterator is going to return from the row. I'm guessing that because this happening during compaction it's using asked for the maximum possible number of columns. Aaron On 10 Feb 2011, at 21:37, shimi wrote: On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan No. I was too puzzled by the numbers On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.com wrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron I will. I hope the data was not corrupted. On Thu, Feb 10, 2011 at 10:30 AM, aaron morton aa...@thelastpickle.com wrote: Shimi, You may be seeing the result of CASSANDRA-1992, are you able to test with the most recent 0.7 build ? https://hudson.apache.org/hudson/job/Cassandra-0.7/ Aaron On 10 Feb 2011, at 13:42, Dan Hendry wrote: Out of curiosity, do you really have on the order of 1,986,622,313 elements (I believe elements=keys) in the cf? Dan *From:* shimi [mailto:shim...@gmail.com] *Sent:* February-09-11 15:06 *To:* user@cassandra.apache.org *Subject:* Exceptions on 0.7.0 I have a 4 node test cluster were I test the port to 0.7.0 from 0.6.X On 3 out of the 4 nodes I get exceptions in the log. I am using RP. Changes that I did: 1. changed the replication factor from 3 to 4 2. configured the nodes to use Dynamic Snitch 3. RR of 0.33 I run repair on 2 nodes before I noticed the errors. One of them is having the first error and the other the second. I restart the nodes but I still get the exceptions. The following Exception I get from 2 nodes: WARN [CompactionExecutor:1] 2011-02-09 19:50:51,281 BloomFilter.java (line 84) Cannot provide an optimal Bloom Filter for 1986622313 elements (1/4 buckets per element). ERROR [CompactionExecutor:1] 2011-02-09 19:51:10,190 AbstractCassandraDaemon.java (line 91) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.IOError: java.io.EOFException at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:105) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.next(SSTableIdentityIterator.java:34) at org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:284) at org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326) at org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:68
Distribution Factor: part of the solution to many-CF problem?
Cassandra is both distributed and replicated. We have Replication Factor but no Distribution Factor! Distribution Factor would define over how many nodes a CF should be distributed. Say you want to support millions of multi-tenant users in clusters with thousands of nodes, where you don't know the user's schema in advance, so you can't have users share CFs. In this case you wouldn't want to spread out each user's Column Families over thousands of nodes! You would want something like: RF=3, DF=10 i.e. distribute each CF over 10 nodes, within those nodes replicate 3 times. One implementation of DF would be to hash the CF name, and use the same strategies defined for RF to choose the N nodes in DF=N.
Re: Distribution Factor: part of the solution to many-CF problem?
No, that's not what I mean at all. That message is about the ability to use different partitioners for different CFs, say, RandomPartitioner for one, OPP for another. I'm talking about defining how many nodes a CF should be distributed over, which would be useful if you have a lot of nodes and a lot of small CFs (small relative to the total amount of data). On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton aa...@thelastpickle.comwrote: Sounds a bit like this idea http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html Aaron On 22/02/2011, at 1:28 AM, David Boxenhorn da...@lookin2.com wrote: Cassandra is both distributed and replicated. We have Replication Factor but no Distribution Factor! Distribution Factor would define over how many nodes a CF should be distributed. Say you want to support millions of multi-tenant users in clusters with thousands of nodes, where you don't know the user's schema in advance, so you can't have users share CFs. In this case you wouldn't want to spread out each user's Column Families over thousands of nodes! You would want something like: RF=3, DF=10 i.e. distribute each CF over 10 nodes, within those nodes replicate 3 times. One implementation of DF would be to hash the CF name, and use the same strategies defined for RF to choose the N nodes in DF=N.
Re: Do supercolumns have a purpose?
I agree, that is the way to go. Then each piece of new functionality will not have to be implemented twice. On Sat, Feb 12, 2011 at 9:41 AM, Stu Hood stuh...@gmail.com wrote: I would like to continue to support super columns, but to slowly convert them into compound column names, since that is really all they really are. On Thu, Feb 10, 2011 at 10:16 AM, Frank LoVecchio fr...@isidorey.comwrote: I've found super column families quite useful when using RandomOrderedPartioner on a low-maintenance cluster (as opposed to Byte/Ordered), e.g. returning ordered data from a TimeUUID comparator type; try doing that with one regular column family and secondary indexes (you could obviously sort on the client side, but that is tedious and not logical for older data). On Thu, Feb 10, 2011 at 12:32 AM, David Boxenhorn da...@lookin2.comwrote: Mike, my problem is that I have an database and codebase that already uses supercolumns. If I had to do it over, it wouldn't use them, for the reasons you point out. In fact, I have a feeling that over time supercolumns will become deprecated de facto, if not de jure. That's why I would like to see them represented internally as regular columns, with an upgrade path for backward compatibility. I would love to do it myself! (I haven't looked at the code base, but I don't understand why it should be so hard.) But my employer has other ideas... On Wed, Feb 9, 2011 at 8:14 PM, Mike Malone m...@simplegeo.com wrote: On Tue, Feb 8, 2011 at 2:03 AM, David Boxenhorn da...@lookin2.comwrote: Shaun, I agree with you, but marking them as deprecated is not good enough for me. I can't easily stop using supercolumns. I need an upgrade path. David, Cassandra is open source and community developed. The right thing to do is what's best for the community, which sometimes conflicts with what's best for individual users. Such strife should be minimized, it will never be eliminated. Luckily, because this is an open source, liberal licensed project, if you feel strongly about something you should feel free to add whatever features you want yourself. I'm sure other people in your situation will thank you for it. At a minimum I think it would behoove you to re-read some of the comments here re: why super columns aren't really needed and take another look at your data model and code. I would actually be quite surprised to find a use of super columns that could not be trivially converted to normal columns. In fact, it should be possible to do at the framework/client library layer - you probably wouldn't even need to change any application code. Mike On Tue, Feb 8, 2011 at 3:53 AM, Shaun Cutts sh...@cuttshome.netwrote: I'm a newbie here, but, with apologies for my presumptuousness, I think you should deprecate SuperColumns. They are already distracting you, and as the years go by the cost of supporting them as you add more and more functionality is only likely to get worse. It would be better to concentrate on making the core column families better (and I'm sure we can all think of lots of things we'd like). Just dropping SuperColumns would be bad for your reputation -- and for users like David who are currently using them. But if you mark them clearly as deprecated and explain why and what to do instead (perhaps putting a bit of effort into migration tools... or even a virtual layer supporting arbitrary hierarchical data), then you can drop them in a few years (when you get to 1.0, say), without people feeling betrayed. -- Shaun On Feb 6, 2011, at 3:48 AM, David Boxenhorn wrote: My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then let me say what I want: I want supercolumn families to have any feature that regular column families have. My data model is full of supercolumns. I used them, even though I knew it didn't *have to*, because they were there, which implied to me that I was supposed to use them for some good reason. Now I suspect that they will gradually become less and less functional, as features are added to regular column families and not supported for supercolumn families. On Fri, Feb 4, 2011 at 10:58 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Fri, Feb 4, 2011 at 12:35 AM, Mike Malone m...@simplegeo.comwrote: On Thu, Feb 3, 2011 at 6:44 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn da...@lookin2.com wrote: The advantage would be to enable secondary indexes on supercolumn families. Then I suggest opening a ticket for adding secondary indexes to supercolumn families and voting on it. This will be 1 or 2 order of magnitude less work than getting rid of super column internally, and probably a much better solution anyway. I realize that this is largely
Re: Do supercolumns have a purpose?
Mike, my problem is that I have an database and codebase that already uses supercolumns. If I had to do it over, it wouldn't use them, for the reasons you point out. In fact, I have a feeling that over time supercolumns will become deprecated de facto, if not de jure. That's why I would like to see them represented internally as regular columns, with an upgrade path for backward compatibility. I would love to do it myself! (I haven't looked at the code base, but I don't understand why it should be so hard.) But my employer has other ideas... On Wed, Feb 9, 2011 at 8:14 PM, Mike Malone m...@simplegeo.com wrote: On Tue, Feb 8, 2011 at 2:03 AM, David Boxenhorn da...@lookin2.com wrote: Shaun, I agree with you, but marking them as deprecated is not good enough for me. I can't easily stop using supercolumns. I need an upgrade path. David, Cassandra is open source and community developed. The right thing to do is what's best for the community, which sometimes conflicts with what's best for individual users. Such strife should be minimized, it will never be eliminated. Luckily, because this is an open source, liberal licensed project, if you feel strongly about something you should feel free to add whatever features you want yourself. I'm sure other people in your situation will thank you for it. At a minimum I think it would behoove you to re-read some of the comments here re: why super columns aren't really needed and take another look at your data model and code. I would actually be quite surprised to find a use of super columns that could not be trivially converted to normal columns. In fact, it should be possible to do at the framework/client library layer - you probably wouldn't even need to change any application code. Mike On Tue, Feb 8, 2011 at 3:53 AM, Shaun Cutts sh...@cuttshome.net wrote: I'm a newbie here, but, with apologies for my presumptuousness, I think you should deprecate SuperColumns. They are already distracting you, and as the years go by the cost of supporting them as you add more and more functionality is only likely to get worse. It would be better to concentrate on making the core column families better (and I'm sure we can all think of lots of things we'd like). Just dropping SuperColumns would be bad for your reputation -- and for users like David who are currently using them. But if you mark them clearly as deprecated and explain why and what to do instead (perhaps putting a bit of effort into migration tools... or even a virtual layer supporting arbitrary hierarchical data), then you can drop them in a few years (when you get to 1.0, say), without people feeling betrayed. -- Shaun On Feb 6, 2011, at 3:48 AM, David Boxenhorn wrote: My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then let me say what I want: I want supercolumn families to have any feature that regular column families have. My data model is full of supercolumns. I used them, even though I knew it didn't *have to*, because they were there, which implied to me that I was supposed to use them for some good reason. Now I suspect that they will gradually become less and less functional, as features are added to regular column families and not supported for supercolumn families. On Fri, Feb 4, 2011 at 10:58 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Fri, Feb 4, 2011 at 12:35 AM, Mike Malone m...@simplegeo.comwrote: On Thu, Feb 3, 2011 at 6:44 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn da...@lookin2.comwrote: The advantage would be to enable secondary indexes on supercolumn families. Then I suggest opening a ticket for adding secondary indexes to supercolumn families and voting on it. This will be 1 or 2 order of magnitude less work than getting rid of super column internally, and probably a much better solution anyway. I realize that this is largely subjective, and on such matters code speaks louder than words, but I don't think I agree with you on the issue of which alternative is less work, or even which is a better solution. You are right, I put probably too much emphase in that sentence. My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then I suspect that *if* the only goal is to get secondary indexes on super columns, then there is a good chance this would be less work than getting rid of super columns. But to be fair, secondary indexes on super columns may not make too much sense without #598, which itself would require quite some work, so clearly I spoke a bit quickly. If the goal is to have a hierarchical model, limiting the depth to two seems arbitrary. Why
Re: Do supercolumns have a purpose?
Shaun, I agree with you, but marking them as deprecated is not good enough for me. I can't easily stop using supercolumns. I need an upgrade path. On Tue, Feb 8, 2011 at 3:53 AM, Shaun Cutts sh...@cuttshome.net wrote: I'm a newbie here, but, with apologies for my presumptuousness, I think you should deprecate SuperColumns. They are already distracting you, and as the years go by the cost of supporting them as you add more and more functionality is only likely to get worse. It would be better to concentrate on making the core column families better (and I'm sure we can all think of lots of things we'd like). Just dropping SuperColumns would be bad for your reputation -- and for users like David who are currently using them. But if you mark them clearly as deprecated and explain why and what to do instead (perhaps putting a bit of effort into migration tools... or even a virtual layer supporting arbitrary hierarchical data), then you can drop them in a few years (when you get to 1.0, say), without people feeling betrayed. -- Shaun On Feb 6, 2011, at 3:48 AM, David Boxenhorn wrote: My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then let me say what I want: I want supercolumn families to have any feature that regular column families have. My data model is full of supercolumns. I used them, even though I knew it didn't *have to*, because they were there, which implied to me that I was supposed to use them for some good reason. Now I suspect that they will gradually become less and less functional, as features are added to regular column families and not supported for supercolumn families. On Fri, Feb 4, 2011 at 10:58 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Fri, Feb 4, 2011 at 12:35 AM, Mike Malone m...@simplegeo.com wrote: On Thu, Feb 3, 2011 at 6:44 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn da...@lookin2.comwrote: The advantage would be to enable secondary indexes on supercolumn families. Then I suggest opening a ticket for adding secondary indexes to supercolumn families and voting on it. This will be 1 or 2 order of magnitude less work than getting rid of super column internally, and probably a much better solution anyway. I realize that this is largely subjective, and on such matters code speaks louder than words, but I don't think I agree with you on the issue of which alternative is less work, or even which is a better solution. You are right, I put probably too much emphase in that sentence. My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then I suspect that *if* the only goal is to get secondary indexes on super columns, then there is a good chance this would be less work than getting rid of super columns. But to be fair, secondary indexes on super columns may not make too much sense without #598, which itself would require quite some work, so clearly I spoke a bit quickly. If the goal is to have a hierarchical model, limiting the depth to two seems arbitrary. Why not go all the way and allow an arbitrarily deep hierarchy? If a more sophisticated hierarchical model is deemed unnecessary, or impractical, allowing a depth of two seems inconsistent and unnecessary. It's pretty trivial to overlay a hierarchical model on top of the map-of-sorted-maps model that Cassandra implements. Ed Anuff has implemented a custom comparator that does the job [1]. Google's Megastore has a similar architecture and goes even further [2]. It seems to me that super columns are a historical artifact from Cassandra's early life as Facebook's inbox storage system. They needed posting lists of messages, sharded by user. So that's what they built. In my dealings with the Cassandra code, super columns end up making a mess all over the place when algorithms need to be special cased and branch based on the column/supercolumn distinction. I won't even mention what it does to the thrift interface. Actually, I agree with you, more than you know. If I were to start coding Cassandra now, I wouldn't include super columns (and I would probably not go for a depth unlimited hierarchical model either). But it's there and I'm not sure getting rid of them fully (meaning, including in thrift) is an option (it would be a big compatibility breakage). And (even though I certainly though about this more than once :)) I'm slightly less enthusiastic about keeping them in thrift but encoding them in regular column family internally: it would still be a lot of work but we would still probably end up with nasty tricks to stick to the thrift api. -- Sylvain Mike [1] http://www.anuff.com/2010/07
Re: time to live rows
I hope you don't consider this a hijack of the thread... What I'd like to know is the following: The GC removes TTL rows some time after they expire, at its convenience. But will they stop being returned as soon as they expire? (This is the expected behavior...) On Tue, Feb 8, 2011 at 5:11 PM, Kallin Nagelberg kallin.nagelb...@gmail.com wrote: So the empty row will be ultimately removed then? Is there a way to for the GC to verify this? Thanks, -Kal On Tue, Feb 8, 2011 at 2:21 AM, Stu Hood stuh...@gmail.com wrote: The expired columns were converted into tombstones, which will live for the GC timeout. The empty row will be cleaned up when those tombstones are removed. Returning the empty row is unfortunate... we'd love to find a more appropriate solution that might not involve endless scanning. See http://wiki.apache.org/cassandra/FAQ#i_deleted_what_gives http://wiki.apache.org/cassandra/FAQ#range_ghosts On Mon, Feb 7, 2011 at 1:49 PM, Kallin Nagelberg kallin.nagelb...@gmail.com wrote: I also tried forcing a major compaction on the column family using JMX but the row remains. On Mon, Feb 7, 2011 at 4:43 PM, Kallin Nagelberg kallin.nagelb...@gmail.com wrote: I tried that but I still see the row coming back on a list columnfamily in the CLI. My concern is that there will be a pointer to an empty row for all eternity. -Kal On Mon, Feb 7, 2011 at 4:38 PM, Aaron Morton aa...@thelastpickle.com wrote: Deleting all the columns in a row via TTL has the same affect as deleting th row, the data will physically by removed during compaction. Aaron On 08 Feb, 2011,at 10:24 AM, Bill Speirs bill.spe...@gmail.com wrote: I don't think this is supported (but I could be completely wrong). However, I'd love to see this functionality as well. How would one go about requesting such a feature? Bill- On Mon, Feb 7, 2011 at 4:15 PM, Kallin Nagelberg kallin.nagelb...@gmail.com wrote: Hey, I have read about the new TTL columns in Cassandra 0.7. In my case I'd like to expire an entire row automatically after a certain amount of time. Is this possible as well? Thanks, -Kal
Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up
Why not synchronize on the client side? Make sure that the process that allocates user ids runs on only a single machine, in a synchronized method, and uses QUORUM for its reads and writes to Cassandra? On Sun, Feb 6, 2011 at 11:02 PM, Aaron Morton aa...@thelastpickle.comwrote: If you mix mysql and Cassandra you risk creating a single point of failure around the mysql system. If you have use data that changes infrequently, a row cache in cassandra will give you fast reads. Aaron On 5/02/2011, at 8:13 AM, Aklin_81 asdk...@gmail.com wrote: Thanks so much Ryan for the links; I'll definitely take them into consideration. Just another thought which came to my mind:- perhaps it may be beneficial to store(or duplicate) some of the data like the Login credentials particularly userId to User's Name mapping, etc (which is very heavily read), in a fast MyISAM table. This could solve the problem of keys though auto-generated unique sequential primary keys. I could use the same keys for Cassandra rows for that user. And also since Cassandra reads are relatively slow, it makes sense to store data like userId to Name mapping in MyISAM as this data would be required after almost all queries to the database. Regards -Asil On Fri, Feb 4, 2011 at 10:14 PM, Ryan King r...@twitter.com wrote: On Thu, Feb 3, 2011 at 9:12 PM, Aklin_81 asdk...@gmail.com wrote: Thanks Matthew Ryan, The main inspiration behind me trying to generate Ids in sequential manner is to reduce the size of the userId, since I am using it for heavy denormalization. UUIDs are 16 bytes long, but I can also have a unique Id in just 4 bytes, and since this is just a one time process when the user signs-up, it makes sense to try cutting down the space requirements, if it is feasible without any downsides(!?). I am also using userIds to attach to Id of the other data of the user on my application. If I could reduce the userId size that I can also reduce the size of other Ids, I could drastically cut down the space requirements. [Sorry for this question is not directly related to cassandra but I think Cassandra factors here because of its tuneable consistency] Don't generate these ids in cassandra. Use something like snowflake, flickr's ticket servers [2] or zookeeper sequential nodes. -ryan 1. http://github.com/twitter/snowflake 2. http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
Re: Do supercolumns have a purpose?
My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then let me say what I want: I want supercolumn families to have any feature that regular column families have. My data model is full of supercolumns. I used them, even though I knew it didn't *have to*, because they were there, which implied to me that I was supposed to use them for some good reason. Now I suspect that they will gradually become less and less functional, as features are added to regular column families and not supported for supercolumn families. On Fri, Feb 4, 2011 at 10:58 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Fri, Feb 4, 2011 at 12:35 AM, Mike Malone m...@simplegeo.com wrote: On Thu, Feb 3, 2011 at 6:44 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn da...@lookin2.comwrote: The advantage would be to enable secondary indexes on supercolumn families. Then I suggest opening a ticket for adding secondary indexes to supercolumn families and voting on it. This will be 1 or 2 order of magnitude less work than getting rid of super column internally, and probably a much better solution anyway. I realize that this is largely subjective, and on such matters code speaks louder than words, but I don't think I agree with you on the issue of which alternative is less work, or even which is a better solution. You are right, I put probably too much emphase in that sentence. My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then I suspect that *if* the only goal is to get secondary indexes on super columns, then there is a good chance this would be less work than getting rid of super columns. But to be fair, secondary indexes on super columns may not make too much sense without #598, which itself would require quite some work, so clearly I spoke a bit quickly. If the goal is to have a hierarchical model, limiting the depth to two seems arbitrary. Why not go all the way and allow an arbitrarily deep hierarchy? If a more sophisticated hierarchical model is deemed unnecessary, or impractical, allowing a depth of two seems inconsistent and unnecessary. It's pretty trivial to overlay a hierarchical model on top of the map-of-sorted-maps model that Cassandra implements. Ed Anuff has implemented a custom comparator that does the job [1]. Google's Megastore has a similar architecture and goes even further [2]. It seems to me that super columns are a historical artifact from Cassandra's early life as Facebook's inbox storage system. They needed posting lists of messages, sharded by user. So that's what they built. In my dealings with the Cassandra code, super columns end up making a mess all over the place when algorithms need to be special cased and branch based on the column/supercolumn distinction. I won't even mention what it does to the thrift interface. Actually, I agree with you, more than you know. If I were to start coding Cassandra now, I wouldn't include super columns (and I would probably not go for a depth unlimited hierarchical model either). But it's there and I'm not sure getting rid of them fully (meaning, including in thrift) is an option (it would be a big compatibility breakage). And (even though I certainly though about this more than once :)) I'm slightly less enthusiastic about keeping them in thrift but encoding them in regular column family internally: it would still be a lot of work but we would still probably end up with nasty tricks to stick to the thrift api. -- Sylvain Mike [1] http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html [2] http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf
Do supercolumns have a purpose?
Is there any advantage to using supercolumns (columnFamilyName[superColumnName[columnName[val]]]) instead of regular columns with concatenated keys (columnFamilyName[superColumnName@columnName[val]])? When I designed my data model, I used supercolumns wherever I needed two levels of key depth - just because they were there, and I figured that they must be there for a reason. Now I see that in 0.7 secondary indexes don't work on supercolumns or subcolumns (is that right?), which seems to me like a very serious limitation of supercolumn families. It raises the question: Is there anything that supercolumn families are good for? And here's a related question: Why can't Cassandra implement supercolumn families as regular column families, internally, and give you that functionality?
Re: Do supercolumns have a purpose?
Thanks Sylvain! Can I vote for internally implementing supercolumn families as regular column families? (With a smooth upgrade process that doesn't require shutting down a live cluster.) What if supercolumn families were supported as regular column families + an index (on what used to be supercolumn keys)? Would that solve some problems? On Thu, Feb 3, 2011 at 2:00 PM, Sylvain Lebresne sylv...@datastax.comwrote: Is there any advantage to using supercolumns (columnFamilyName[superColumnName[columnName[val]]]) instead of regular columns with concatenated keys (columnFamilyName[superColumnName@columnName[val]])? When I designed my data model, I used supercolumns wherever I needed two levels of key depth - just because they were there, and I figured that they must be there for a reason. Now I see that in 0.7 secondary indexes don't work on supercolumns or subcolumns (is that right?), which seems to me like a very serious limitation of supercolumn families. It raises the question: Is there anything that supercolumn families are good for? There is a bunch of queries that you cannot do (or less conveniently) if you encode super columns using regular columns with concatenated keys: 1) If you use regular columns with concatenated keys, the count argument count simple columns. With super columns it counts super columns. It means that you can't do give me the 10 first super columns of this row. 2) If you need to get x super columns by name, you'll have to issue x get_slice query (one of each super column). On the client side it sucks. Internally in Cassandra we could do it reasonably well though. 3) You cannot remove entire super columns since there is no support for range deletions. Moreover, the encoding with concatenated keys uses more disk space (and less disk used for the same information means less things to read so it may have a slight impact on read performance too -- it's probably really slight on most usage but nevertheless). And here's a related question: Why can't Cassandra implement supercolumn families as regular column families, internally, and give you that functionality? For the 1) and 2) above, we could deal with those internally fairly easily I think and rather well (which means it wouldn't be much worse performance-wise than with the actual implementaion of super columns, not that it would be better). For 3), range deletes are harder and would require more significant changes (that doesn't mean that Cassandra will never have it). Even without that, there would be the disk space lost. -- Sylvain
Re: Do supercolumns have a purpose?
The advantage would be to enable secondary indexes on supercolumn families. I understand from this thread that indexes are supercolumn families are not going to be: http://www.mail-archive.com/user@cassandra.apache.org/msg09527.html Which, it seems to me, effectively deprecates supercolumn families. (I don't see any of the three problems you brought up as overcoming this problem, except, perhaps, for special cases.) On Thu, Feb 3, 2011 at 3:32 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Thu, Feb 3, 2011 at 1:33 PM, David Boxenhorn da...@lookin2.com wrote: Thanks Sylvain! Can I vote for internally implementing supercolumn families as regular column families? (With a smooth upgrade process that doesn't require shutting down a live cluster.) I forgot to add that I don't know if this make a lot of sense. That would be a fairly major refactor (so error prone), you'd still have to deal with the point I mentioned in my previous mail (for range deletes you would have to change the on-disk format for instance), and all this for no actual benefits, even downsides actually (encoded supercolumn will take more space on-disk (and on-memory)). Super columns are there and work fairly well, so what would be the point ? I'm only just saying that 'in theory', super columns are not the super shiny magical feature that give you stuff you can't hope to have with only regular column family. That doesn't make then at least nice. That being said, you are free to create whatever ticket you want and vote for it. Don't expect too much support tough :) What if supercolumn families were supported as regular column families + an index (on what used to be supercolumn keys)? Would that solve some problems? You'd still have to remember for each CF if it has this index on what used to be supercolumn keys and handle those differently. Really not convince this would make the code cleaner that how it is now. And making the code cleaner is really the only reason I can thing of for wanting to get rid of super columns internally, so ... On Thu, Feb 3, 2011 at 2:00 PM, Sylvain Lebresne sylv...@datastax.comwrote: Is there any advantage to using supercolumns (columnFamilyName[superColumnName[columnName[val]]]) instead of regular columns with concatenated keys (columnFamilyName[superColumnName@columnName[val]])? When I designed my data model, I used supercolumns wherever I needed two levels of key depth - just because they were there, and I figured that they must be there for a reason. Now I see that in 0.7 secondary indexes don't work on supercolumns or subcolumns (is that right?), which seems to me like a very serious limitation of supercolumn families. It raises the question: Is there anything that supercolumn families are good for? There is a bunch of queries that you cannot do (or less conveniently) if you encode super columns using regular columns with concatenated keys: 1) If you use regular columns with concatenated keys, the count argument count simple columns. With super columns it counts super columns. It means that you can't do give me the 10 first super columns of this row. 2) If you need to get x super columns by name, you'll have to issue x get_slice query (one of each super column). On the client side it sucks. Internally in Cassandra we could do it reasonably well though. 3) You cannot remove entire super columns since there is no support for range deletions. Moreover, the encoding with concatenated keys uses more disk space (and less disk used for the same information means less things to read so it may have a slight impact on read performance too -- it's probably really slight on most usage but nevertheless). And here's a related question: Why can't Cassandra implement supercolumn families as regular column families, internally, and give you that functionality? For the 1) and 2) above, we could deal with those internally fairly easily I think and rather well (which means it wouldn't be much worse performance-wise than with the actual implementaion of super columns, not that it would be better). For 3), range deletes are harder and would require more significant changes (that doesn't mean that Cassandra will never have it). Even without that, there would be the disk space lost. -- Sylvain
Re: Do supercolumns have a purpose?
Well, I am an actual active developer and I have managed to do pretty nice stuffs with Cassandra - without secondary indexes so far. But I'm looking forward to having secondary indexes in my arsenal when new functional requirements come up, and I'm bummed out that my early design decision to use supercolums wherever I could, instead of concatenating keys, has closed off a whole lot of possibilities. I knew when I started that secondary keys were in the future, if I had known that they would be only for regular column families I wouldn't have used supercolumn families in the first place, now I'm pretty much stuck (too late to go back - we're launching in March). On Thu, Feb 3, 2011 at 4:44 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn da...@lookin2.com wrote: The advantage would be to enable secondary indexes on supercolumn families. Then I suggest opening a ticket for adding secondary indexes to supercolumn families and voting on it. This will be 1 or 2 order of magnitude less work than getting rid of super column internally, and probably a much better solution anyway. I understand from this thread that indexes are supercolumn families are not going to be: http://www.mail-archive.com/user@cassandra.apache.org/msg09527.html I should maybe let Jonathan answer this one, but the way I understand it is that adding secondary indexes to super column is not a top priority to actual active developers. Not that it will never ever happen. And voting for tickets in JIRA is one way to help make it raise its priority. In any case, if the goal you're pursuing is adding secondary indexes to super column, then that's the ticket you should open, and if after careful consideration it is decided that getting rid of super column is the best way to reach that goal then so be it (spoiler: it is not). Which, it seems to me, effectively deprecates supercolumn families. (I don't see any of the three problems you brought up as overcoming this problem, except, perhaps, for special cases.) You're untitled to your opinions obviously but I doubt everyone share that feeling (I don't for instance). Before 0.7, there was no secondary indexes at all and still a bunch of people managed to do pretty nice stuffs with Cassandra. In particular denormalized views are sometimes (often?) preferable to secondary indexes for performance reasons. For that super columns are quite handy. -- Sylvain On Thu, Feb 3, 2011 at 3:32 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Thu, Feb 3, 2011 at 1:33 PM, David Boxenhorn da...@lookin2.comwrote: Thanks Sylvain! Can I vote for internally implementing supercolumn families as regular column families? (With a smooth upgrade process that doesn't require shutting down a live cluster.) I forgot to add that I don't know if this make a lot of sense. That would be a fairly major refactor (so error prone), you'd still have to deal with the point I mentioned in my previous mail (for range deletes you would have to change the on-disk format for instance), and all this for no actual benefits, even downsides actually (encoded supercolumn will take more space on-disk (and on-memory)). Super columns are there and work fairly well, so what would be the point ? I'm only just saying that 'in theory', super columns are not the super shiny magical feature that give you stuff you can't hope to have with only regular column family. That doesn't make then at least nice. That being said, you are free to create whatever ticket you want and vote for it. Don't expect too much support tough :) What if supercolumn families were supported as regular column families + an index (on what used to be supercolumn keys)? Would that solve some problems? You'd still have to remember for each CF if it has this index on what used to be supercolumn keys and handle those differently. Really not convince this would make the code cleaner that how it is now. And making the code cleaner is really the only reason I can thing of for wanting to get rid of super columns internally, so ... On Thu, Feb 3, 2011 at 2:00 PM, Sylvain Lebresne sylv...@datastax.comwrote: Is there any advantage to using supercolumns (columnFamilyName[superColumnName[columnName[val]]]) instead of regular columns with concatenated keys (columnFamilyName[superColumnName@columnName[val]])? When I designed my data model, I used supercolumns wherever I needed two levels of key depth - just because they were there, and I figured that they must be there for a reason. Now I see that in 0.7 secondary indexes don't work on supercolumns or subcolumns (is that right?), which seems to me like a very serious limitation of supercolumn families. It raises the question: Is there anything that supercolumn families are good for? There is a bunch of queries that you cannot do (or less conveniently) if you encode super
Re: Multi-tenancy, and authentication and authorization
As far as I can tell, if Cassandra supports three levels of configuration (server, keyspace, column family) we can support multi-tenancy. It is trivial to give each tenant their own keyspace (e.g. just use the tenant's id as the keyspace name) and let them go wild. (Any out-of-bounds behavior on the CF level will be stopped at the keyspace and server level before doing any damage.) I don't think Cassandra needs to know about end-users. From Cassandra's point of view the tenant is the user. On Thu, Jan 20, 2011 at 7:00 AM, indika kumara indika.k...@gmail.comwrote: +1 Are there JIRAs for these requirements? I would like to contribute from my capacity. As per my understanding, to support some muti-tenant models, it is needed to qualified keyspaces' names, Cfs' names, etc. with the tenant namespace (or id). The easiest way to do this would be to modify corresponding constructs transparently. I tought of a stage (optional and configurable) prior to authorization. Is there any better solutions? I appreciate the community's suggestions. Moreover, It is needed to send the tenant NS(id) with the user credentials (A users belongs to this tenant (org.)). For that purpose, I thought of using the user credentials in the AuthenticationRequest. s there any better solution? I would like to have a MT support at the Cassandra level which is optional and configurable. Thanks, Indika On Wed, Jan 19, 2011 at 7:40 PM, David Boxenhorn da...@lookin2.comwrote: Yes, the way I see it - and it becomes even more necessary for a multi-tenant configuration - there should be completely separate configurations for applications and for servers. - Application configuration is based on data and usage characteristics of your application. - Server configuration is based on the specific hardware limitations of the server. Obviously, server limitations take priority over application configuration. Assuming that each tenant in a multi-tenant environment gets one keyspace, you would also want to enforce limitations based on keyspace (which correspond to parameters that the tenant payed for). So now we have three levels: 1. Server configuration (top priority) 2. Keyspace configuration (payed-for service - second priority) 3. Column family configuration (configuration provided by tenant - third priority) On Wed, Jan 19, 2011 at 3:15 PM, indika kumara indika.k...@gmail.comwrote: As the actual problem is mostly related to the number of CFs in the system (may be number of the columns), I still believe that supporting exposing the Cassandra ‘as-is’ to a tenant is doable and suitable though need some fixes. That multi-tenancy model allows a tenant to use the programming model of the Cassandra ‘as-is’, enabling the seamless migration of an application that uses the Cassandra into the cloud. Moreover, In order to support different SLA requirements of different tenants, the configurability of keyspaces, cfs, etc., per tenant may be critical. However, there are trade-offs among usability, memory consumption, and performance. I believe that it is important to consider the SLA requirements of different tenants when deciding the strategies for controlling resource consumption. I like to the idea of system-wide parameters for controlling resource usage. I believe that the tenant-specific parameters are equally important. There are resources, and each tenant can claim a portion of them based on SLA. For instance, if there is a threshold on the number of columns per a node, it should be able to decide how many columns a particular tenant can have. It allows selecting a suitable Cassandra cluster for a tenant based on his or her SLA. I believe the capability to configure resource controlling parameters per keyspace would be important to support a keyspace per tenant model. Furthermore, In order to maximize the resource sharing among tenants, a threshold (on a resource) per keyspace should not be a hard limit. Rather, it should be oscillated between a hard minimum and a maximum. For example, if a particular tenant needs more resources at a given time, he or she should be possible to borrow from the others up to the maximum. The threshold is only considered when a tenant is assigned to a cluster - the remaining resources of a cluster should be equal or higher than the resource limit of the tenant. It may need to spread a single keyspace across multiple clusters; especially when there are no enough resources in a single cluster. I believe that it would be better to have a flexibility to change seamlessly multi-tenancy implementation models such as the Cassadra ‘as-is’, the keyspace per tenant model, a keyspace for all tenants, and so on. Based on what I have learnt, each model requires adding tenant id (name space) to a keyspace’s name or cf’s name or raw key, or column’s name. Would it be better to have a kind of pluggable handler that can access those resources prior
Re: Multi-tenancy, and authentication and authorization
I have added my comments to this issue: https://issues.apache.org/jira/browse/CASSANDRA-2006 Good luck! On Thu, Jan 20, 2011 at 1:53 PM, indika kumara indika.k...@gmail.comwrote: Thanks David We decided to do it at our client-side as the initial implementation. I will investigate the approaches for supporting the fine grained control of the resources consumed by a sever, tenant, and CF. Thanks, Indika On Thu, Jan 20, 2011 at 3:20 PM, David Boxenhorn da...@lookin2.comwrote: As far as I can tell, if Cassandra supports three levels of configuration (server, keyspace, column family) we can support multi-tenancy. It is trivial to give each tenant their own keyspace (e.g. just use the tenant's id as the keyspace name) and let them go wild. (Any out-of-bounds behavior on the CF level will be stopped at the keyspace and server level before doing any damage.) I don't think Cassandra needs to know about end-users. From Cassandra's point of view the tenant is the user. On Thu, Jan 20, 2011 at 7:00 AM, indika kumara indika.k...@gmail.comwrote: +1 Are there JIRAs for these requirements? I would like to contribute from my capacity. As per my understanding, to support some muti-tenant models, it is needed to qualified keyspaces' names, Cfs' names, etc. with the tenant namespace (or id). The easiest way to do this would be to modify corresponding constructs transparently. I tought of a stage (optional and configurable) prior to authorization. Is there any better solutions? I appreciate the community's suggestions. Moreover, It is needed to send the tenant NS(id) with the user credentials (A users belongs to this tenant (org.)). For that purpose, I thought of using the user credentials in the AuthenticationRequest. s there any better solution? I would like to have a MT support at the Cassandra level which is optional and configurable. Thanks, Indika On Wed, Jan 19, 2011 at 7:40 PM, David Boxenhorn da...@lookin2.comwrote: Yes, the way I see it - and it becomes even more necessary for a multi-tenant configuration - there should be completely separate configurations for applications and for servers. - Application configuration is based on data and usage characteristics of your application. - Server configuration is based on the specific hardware limitations of the server. Obviously, server limitations take priority over application configuration. Assuming that each tenant in a multi-tenant environment gets one keyspace, you would also want to enforce limitations based on keyspace (which correspond to parameters that the tenant payed for). So now we have three levels: 1. Server configuration (top priority) 2. Keyspace configuration (payed-for service - second priority) 3. Column family configuration (configuration provided by tenant - third priority) On Wed, Jan 19, 2011 at 3:15 PM, indika kumara indika.k...@gmail.comwrote: As the actual problem is mostly related to the number of CFs in the system (may be number of the columns), I still believe that supporting exposing the Cassandra ‘as-is’ to a tenant is doable and suitable though need some fixes. That multi-tenancy model allows a tenant to use the programming model of the Cassandra ‘as-is’, enabling the seamless migration of an application that uses the Cassandra into the cloud. Moreover, In order to support different SLA requirements of different tenants, the configurability of keyspaces, cfs, etc., per tenant may be critical. However, there are trade-offs among usability, memory consumption, and performance. I believe that it is important to consider the SLA requirements of different tenants when deciding the strategies for controlling resource consumption. I like to the idea of system-wide parameters for controlling resource usage. I believe that the tenant-specific parameters are equally important. There are resources, and each tenant can claim a portion of them based on SLA. For instance, if there is a threshold on the number of columns per a node, it should be able to decide how many columns a particular tenant can have. It allows selecting a suitable Cassandra cluster for a tenant based on his or her SLA. I believe the capability to configure resource controlling parameters per keyspace would be important to support a keyspace per tenant model. Furthermore, In order to maximize the resource sharing among tenants, a threshold (on a resource) per keyspace should not be a hard limit. Rather, it should be oscillated between a hard minimum and a maximum. For example, if a particular tenant needs more resources at a given time, he or she should be possible to borrow from the others up to the maximum. The threshold is only considered when a tenant is assigned to a cluster - the remaining resources of a cluster should be equal or higher than the resource limit of the tenant. It may need to spread a single keyspace across multiple clusters
Re: Use Cassandra to store 2 million records of persons
Cassandra is not a good solution for data mining type problems, since it doesn't have ad-hoc queries. Cassandra is designed to maximize throughput, which is not usually a problem for data mining. On Thu, Jan 20, 2011 at 2:07 PM, Surender Singh suriait2...@gmail.comwrote: Hi All I want to use Apache Cassandra to store information (like first name, last name, gender, address) about 2 million people. Then need to perform analytic and reporting on that data. is need to store information about 2 million people in Mysql and then transfer that information into Cassandra.? Please help me as i m new to Apache Cassandra. if you have any use case like that, please share. Thanks and regards Surender Singh
Re: Multi-tenancy, and authentication and authorization
I'm not sure that you'd still want to retain the ability to individually control how flushing happens on a per-cf basis in order to cater to different workloads that benefit from different flushing behavior. It seems to me like a good system-wide algorithm that works dynamically, and takes into account moment-by-moment usage, can do this better than a human who is guessing and making decisions on a static basis. Having said that, my suggestion doesn't really depend so much on having one memtable or many. Rather, it depends on making flushing behavior dependent on system-wide parameters, which reflect the actual physical resources available per node, rather than per-CF parameters (though per-CF tuning can be taken into account, it should be a suggestion that gets overridden by system-wide needs). On Wed, Jan 19, 2011 at 10:48 AM, Peter Schuller peter.schul...@infidyne.com wrote: Right now there is a one-to-one mapping between memtables and SSTables. Instead of that, would it be possible to have one giant memtable for each Cassandra instance, with partial flushing to SSTs? I think a complication here is that, although I agree things need to be easier to tweak at least for the common case, I'm pretty sure you'd still want to retain the ability to individually control how flushing happens on a per-cf basis in order to cater to different workloads that benefit from different flushing behavior. I suspect the main concern here may be that there is a desire to have better overal control over how flushing happens and when writes start blocking, rather than necessarily implying that there can't be more than one memtable (the ticket Stu posted seems to address one such means of control). -- / Peter Schuller
Re: Multi-tenancy, and authentication and authorization
+1 On Wed, Jan 19, 2011 at 10:35 AM, Stu Hood stuh...@gmail.com wrote: Opened https://issues.apache.org/jira/browse/CASSANDRA-2006 with the solution we had suggested on the MultiTenant wiki page. On Tue, Jan 18, 2011 at 11:56 PM, David Boxenhorn da...@lookin2.comwrote: I think tuning of Cassandra is overly complex, and even with a single tenant you can run into problems with too many CFs. Right now there is a one-to-one mapping between memtables and SSTables. Instead of that, would it be possible to have one giant memtable for each Cassandra instance, with partial flushing to SSTs? It seems to me like a single memtable would make it MUCH easier to tune Cassandra, since the decision whether to (partially) flush the memtable to disk could be made on a node-wide basis, based on the resources you really have, instead of the guess-work that we are forced to do today.
Re: Multi-tenancy, and authentication and authorization
Yes, the way I see it - and it becomes even more necessary for a multi-tenant configuration - there should be completely separate configurations for applications and for servers. - Application configuration is based on data and usage characteristics of your application. - Server configuration is based on the specific hardware limitations of the server. Obviously, server limitations take priority over application configuration. Assuming that each tenant in a multi-tenant environment gets one keyspace, you would also want to enforce limitations based on keyspace (which correspond to parameters that the tenant payed for). So now we have three levels: 1. Server configuration (top priority) 2. Keyspace configuration (payed-for service - second priority) 3. Column family configuration (configuration provided by tenant - third priority) On Wed, Jan 19, 2011 at 3:15 PM, indika kumara indika.k...@gmail.comwrote: As the actual problem is mostly related to the number of CFs in the system (may be number of the columns), I still believe that supporting exposing the Cassandra ‘as-is’ to a tenant is doable and suitable though need some fixes. That multi-tenancy model allows a tenant to use the programming model of the Cassandra ‘as-is’, enabling the seamless migration of an application that uses the Cassandra into the cloud. Moreover, In order to support different SLA requirements of different tenants, the configurability of keyspaces, cfs, etc., per tenant may be critical. However, there are trade-offs among usability, memory consumption, and performance. I believe that it is important to consider the SLA requirements of different tenants when deciding the strategies for controlling resource consumption. I like to the idea of system-wide parameters for controlling resource usage. I believe that the tenant-specific parameters are equally important. There are resources, and each tenant can claim a portion of them based on SLA. For instance, if there is a threshold on the number of columns per a node, it should be able to decide how many columns a particular tenant can have. It allows selecting a suitable Cassandra cluster for a tenant based on his or her SLA. I believe the capability to configure resource controlling parameters per keyspace would be important to support a keyspace per tenant model. Furthermore, In order to maximize the resource sharing among tenants, a threshold (on a resource) per keyspace should not be a hard limit. Rather, it should be oscillated between a hard minimum and a maximum. For example, if a particular tenant needs more resources at a given time, he or she should be possible to borrow from the others up to the maximum. The threshold is only considered when a tenant is assigned to a cluster - the remaining resources of a cluster should be equal or higher than the resource limit of the tenant. It may need to spread a single keyspace across multiple clusters; especially when there are no enough resources in a single cluster. I believe that it would be better to have a flexibility to change seamlessly multi-tenancy implementation models such as the Cassadra ‘as-is’, the keyspace per tenant model, a keyspace for all tenants, and so on. Based on what I have learnt, each model requires adding tenant id (name space) to a keyspace’s name or cf’s name or raw key, or column’s name. Would it be better to have a kind of pluggable handler that can access those resources prior to doing the actual operation so that the required changes can be done? May be prior to authorization. Thanks, Indika
Getting the version number
Is there any way to use nodetool (or anything else) to get the Cassandra version number of a deployed cluster?
Re: Getting the version number
Yet another reason to move up to 0.7... Thanks. On Wed, Jan 19, 2011 at 5:27 PM, Daniel Lundin d...@eintr.org wrote: in 0.7 nodetool has a `version` command. On Wed, Jan 19, 2011 at 4:09 PM, David Boxenhorn da...@lookin2.com wrote: Is there any way to use nodetool (or anything else) to get the Cassandra version number of a deployed cluster?
Re: Tombstone lifespan after multiple deletions
Thanks. In other words, before I delete something, I should check to see whether it exists as a live row in the first place. On Tue, Jan 18, 2011 at 9:24 AM, Ryan King r...@twitter.com wrote: On Sun, Jan 16, 2011 at 6:53 AM, David Boxenhorn da...@lookin2.com wrote: If I delete a row, and later on delete it again, before GCGraceSeconds has elapsed, does the tombstone live longer? Each delete is a new tombstone, which should answer your question. -ryan In other words, if I have the following scenario: GCGraceSeconds = 10 days On day 1 I delete a row On day 5 I delete the row again Will the tombstone be removed on day 10 or day 15?
Re: Tombstone lifespan after multiple deletions
Thanks, Aaron, but I'm not 100% clear. My situation is this: My use case spins off rows (not columns) that I no longer need and want to delete. It is possible that these rows were never created in the first place, or were already deleted. This is a very large cleanup task that normally deletes a lot of rows, and the last thing that I want to do is create tombstones for rows that didn't exist in the first place, or lengthen the life on disk of tombstones of rows that are already deleted. So the question is: before I delete, do I have to retrieve the row to see if it exists in the first place? On Tue, Jan 18, 2011 at 11:38 AM, Aaron Morton aa...@thelastpickle.comwrote: AFAIK that's not necessary, there is no need to worry about previous deletes. You can delete stuff that does not even exist, neither batch_mutate or remove are going to throw an error. All the columns that were (roughly speaking) present at your first deletion will be available for GC at the end of the first tombstones life. Same for the second. Say you were to write a col between the two deletes with the same name as one present at the start. The first version of the col is avail for GC after tombstone 1, and the second after tombstone 2. Hope that helps Aaron On 18/01/2011, at 9:37 PM, David Boxenhorn da...@lookin2.com wrote: Thanks. In other words, before I delete something, I should check to see whether it exists as a live row in the first place. On Tue, Jan 18, 2011 at 9:24 AM, Ryan King r...@twitter.com r...@twitter.com wrote: On Sun, Jan 16, 2011 at 6:53 AM, David Boxenhorn da...@lookin2.com da...@lookin2.com wrote: If I delete a row, and later on delete it again, before GCGraceSeconds has elapsed, does the tombstone live longer? Each delete is a new tombstone, which should answer your question. -ryan In other words, if I have the following scenario: GCGraceSeconds = 10 days On day 1 I delete a row On day 5 I delete the row again Will the tombstone be removed on day 10 or day 15?
Re: Tombstone lifespan after multiple deletions
Thanks. On Tue, Jan 18, 2011 at 3:55 PM, Sylvain Lebresne sylv...@riptano.comwrote: On Tue, Jan 18, 2011 at 2:41 PM, David Boxenhorn da...@lookin2.com wrote: Thanks, Aaron, but I'm not 100% clear. My situation is this: My use case spins off rows (not columns) that I no longer need and want to delete. It is possible that these rows were never created in the first place, or were already deleted. This is a very large cleanup task that normally deletes a lot of rows, and the last thing that I want to do is create tombstones for rows that didn't exist in the first place, or lengthen the life on disk of tombstones of rows that are already deleted. So the question is: before I delete, do I have to retrieve the row to see if it exists in the first place? Yes, in your situation you do. On Tue, Jan 18, 2011 at 11:38 AM, Aaron Morton aa...@thelastpickle.com wrote: AFAIK that's not necessary, there is no need to worry about previous deletes. You can delete stuff that does not even exist, neither batch_mutate or remove are going to throw an error. All the columns that were (roughly speaking) present at your first deletion will be available for GC at the end of the first tombstones life. Same for the second. Say you were to write a col between the two deletes with the same name as one present at the start. The first version of the col is avail for GC after tombstone 1, and the second after tombstone 2. Hope that helps Aaron On 18/01/2011, at 9:37 PM, David Boxenhorn da...@lookin2.com wrote: Thanks. In other words, before I delete something, I should check to see whether it exists as a live row in the first place. On Tue, Jan 18, 2011 at 9:24 AM, Ryan King r...@twitter.com wrote: On Sun, Jan 16, 2011 at 6:53 AM, David Boxenhorn da...@lookin2.com wrote: If I delete a row, and later on delete it again, before GCGraceSeconds has elapsed, does the tombstone live longer? Each delete is a new tombstone, which should answer your question. -ryan In other words, if I have the following scenario: GCGraceSeconds = 10 days On day 1 I delete a row On day 5 I delete the row again Will the tombstone be removed on day 10 or day 15?
Re: Multi-tenancy, and authentication and authorization
I think tuning of Cassandra is overly complex, and even with a single tenant you can run into problems with too many CFs. Right now there is a one-to-one mapping between memtables and SSTables. Instead of that, would it be possible to have one giant memtable for each Cassandra instance, with partial flushing to SSTs? It seems to me like a single memtable would make it MUCH easier to tune Cassandra, since the decision whether to (partially) flush the memtable to disk could be made on a node-wide basis, based on the resources you really have, instead of the guess-work that we are forced to do today.
Tombstone lifespan after multiple deletions
If I delete a row, and later on delete it again, before GCGraceSeconds has elapsed, does the tombstone live longer? In other words, if I have the following scenario: GCGraceSeconds = 10 days On day 1 I delete a row On day 5 I delete the row again Will the tombstone be removed on day 10 or day 15?
Re: Usage Pattern : quot;uniquequot; value of a key.
It is unlikely that both racing threads will have exactly the same microsecond timestamp at the moment of creating a new user - so if data you read have exactly the same timestamp you used to write data - this is your data. I think this would have to be combined with CL=QUORUM for both write and read. On Thu, Jan 13, 2011 at 9:57 AM, Oleg Anastasyev olega...@gmail.com wrote: Benoit Perroud benoit at noisette.ch writes: My idea to solve such use case is to have both thread writing the username, but with a colum like lock-RANDOM VALUE, and then read the row, and find out if the first lock column appearing belong to the thread. If this is the case, it can continue the process, otherwise it has been preempted by another thread. This looks ok for this task. As an alternative you can avoid creating extra \lock-random value' column and compare timestamps of new user data you just written. It is unlikely that both racing threads will have exactly the same microsecond timestamp at the moment of creating a new user - so if data you read have exactly the same timestamp you used to write data - this is your data. Another possible way is to use some external lock coordinator, eg zookeeper. Although for this task it looks a bit overkill, but this can become even more valuable, if you have more data concurrency issues to solve and can bear extra 5-10ms update operations latency.
Re: Reclaim deleted rows space
I think that if SSTs are partitioned within the node using RP, so that each partition is small and can be compacted independently of all other partitions, you can implement an algorithm that will spread out the work of compaction over time so that it never takes a node out of commission, as it does now. I have left a comment here to that effect here: https://issues.apache.org/jira/browse/CASSANDRA-1608?focusedCommentId=12980654page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12980654 On Mon, Jan 10, 2011 at 10:56 PM, Jonathan Ellis jbel...@gmail.com wrote: I'd suggest describing your approach on https://issues.apache.org/jira/browse/CASSANDRA-1608, and if it's attractive, porting it to 0.8. It's too late for us to make deep changes in 0.6 and probably even 0.7 for the sake of stability. On Mon, Jan 10, 2011 at 8:00 AM, shimi shim...@gmail.com wrote: I modified the code to limit the size of the SSTables. I will be glad if someone can take a look at it https://github.com/Shimi/cassandra/tree/cassandra-0.6 Shimi On Fri, Jan 7, 2011 at 2:04 AM, Jonathan Shook jsh...@gmail.com wrote: I believe the following condition within submitMinorIfNeeded(...) determines whether to continue, so it's not a hard loop. // if (sstables.size() = minThreshold) ... On Thu, Jan 6, 2011 at 2:51 AM, shimi shim...@gmail.com wrote: According to the code it make sense. submitMinorIfNeeded() calls doCompaction() which calls submitMinorIfNeeded(). With minimumCompactionThreshold = 1 submitMinorIfNeeded() will always run compaction. Shimi On Thu, Jan 6, 2011 at 10:26 AM, shimi shim...@gmail.com wrote: On Wed, Jan 5, 2011 at 11:31 PM, Jonathan Ellis jbel...@gmail.com wrote: Pretty sure there's logic in there that says don't bother compacting a single sstable. No. You can do it. Based on the log I have a feeling that it triggers an infinite compaction loop. On Wed, Jan 5, 2011 at 2:26 PM, shimi shim...@gmail.com wrote: How does minor compaction is triggered? Is it triggered Only when a new SStable is added? I was wondering if triggering a compaction with minimumCompactionThreshold set to 1 would be useful. If this can happen I assume it will do compaction on files with similar size and remove deleted rows on the rest. Shimi On Tue, Jan 4, 2011 at 9:56 PM, Peter Schuller peter.schul...@infidyne.com wrote: I don't have a problem with disk space. I have a problem with the data size. [snip] Bottom line is that I want to reduce the number of requests that goes to disk. Since there is enough data that is no longer valid I can do it by reclaiming the space. The only way to do it is by running Major compaction. I can wait and let Cassandra do it for me but then the data size will get even bigger and the response time will be worst. I can do it manually but I prefer it to happen in the background with less impact on the system Ok - that makes perfect sense then. Sorry for misunderstanding :) So essentially, for workloads that are teetering on the edge of cache warmness and is subject to significant overwrites or removals, it may be beneficial to perform much more aggressive background compaction even though it might waste lots of CPU, to keep the in-memory working set down. There was talk (I think in the compaction redesign ticket) about potentially improving the use of bloom filters such that obsolete data in sstables could be eliminated from the read set without necessitating actual compaction; that might help address cases like these too. I don't think there's a pre-existing silver bullet in a current release; you probably have to live with the need for greater-than-theoretically-optimal memory requirements to keep the working set in memory. -- / Peter Schuller -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Why my posts are marked as spam?
What's wrong with topposting? This email is non-plain and topposted... On Wed, Jan 12, 2011 at 4:32 PM, zGreenfelder zgreenfel...@gmail.comwrote: On 12 January 2011 05:28, Oleg Tsvinev oleg.tsvi...@gmail.com wrote: Whatever I do, it happens :( On Wed, Jan 12, 2011 at 1:53 AM, Arijit Mukherjee ariji...@gmail.com wrote: I think this happens for RTF. Some of the mails in the post are RTF, and the reply button creates an RTF reply - that's when it happens. Wonder how the mail to which I replied was in RTF... Arijit -- And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be. I think it happens for any non-plain text.. be it RTF, HTML, or whatever. at least that's been my limited experience with mailing lists. and for what it's worth (I just had to correct myself, so don't take this as huge criticism), many people are also opposed to topposting .. or adding a reply to the top of an email. FWIW. -- Even the Magic 8 ball has an opinion on email clients: Outlook not so good.
Re: Bootstrapping taking long
My nodes all have themselves in their list of seeds - always did - and everything works. (You may ask why I did this. I don't know, I must have copied it from an example somewhere.) On Wed, Jan 5, 2011 at 9:42 AM, Ran Tavory ran...@gmail.com wrote: I was able to make the node join the ring but I'm confused. What I did is, first when adding the node, this node was not in the seeds list of itself. AFAIK this is how it's supposed to be. So it was able to transfer all data to itself from other nodes but then it stayed in the bootstrapping state. So what I did (and I don't know why it works), is add this node to the seeds list in its own storage-conf.xml file. Then restart the server and then I finally see it in the ring... If I had added the node to the seeds list of itself when first joining it, it would not join the ring but if I do it in two phases it did work. So it's either my misunderstanding or a bug... On Wed, Jan 5, 2011 at 7:14 AM, Ran Tavory ran...@gmail.com wrote: The new node does not see itself as part of the ring, it sees all others but itself, so from that perspective the view is consistent. The only problem is that the node never finishes to bootstrap. It stays in this state for hours (It's been 20 hours now...) $ bin/nodetool -p 9004 -h localhost streams Mode: Bootstrapping Not sending any streams. Not receiving any streams. On Wed, Jan 5, 2011 at 1:20 AM, Nate McCall n...@riptano.com wrote: Does the new node have itself in the list of seeds per chance? This could cause some issues if so. On Tue, Jan 4, 2011 at 4:10 PM, Ran Tavory ran...@gmail.com wrote: I'm still at lost. I haven't been able to resolve this. I tried adding another node at a different location on the ring but this node too remains stuck in the bootstrapping state for many hours without any of the other nodes being busy with anti compaction or anything else. I don't know what's keeping it from finishing the bootstrap,no CPU, no io, files were already streamed so what is it waiting for? I read the release notes of 0.6.7 and 0.6.8 and there didn't seem to be anything addressing a similar issue so I figured there was no point in upgrading. But let me know if you think there is. Or any other advice... On Tuesday, January 4, 2011, Ran Tavory ran...@gmail.com wrote: Thanks Jake, but unfortunately the streams directory is empty so I don't think that any of the nodes is anti-compacting data right now or had been in the past 5 hours. It seems that all the data was already transferred to the joining host but the joining node, after having received the data would still remain in bootstrapping mode and not join the cluster. I'm not sure that *all* data was transferred (perhaps other nodes need to transfer more data) but nothing is actually happening so I assume all has been moved. Perhaps it's a configuration error from my part. Should I use I use AutoBootstrap=true ? Anything else I should look out for in the configuration file or something else? On Tue, Jan 4, 2011 at 4:08 PM, Jake Luciani jak...@gmail.com wrote: In 0.6, locate the node doing anti-compaction and look in the streams subdirectory in the keyspace data dir to monitor the anti-compaction progress (it puts new SSTables for bootstrapping node in there) On Tue, Jan 4, 2011 at 8:01 AM, Ran Tavory ran...@gmail.com wrote: Running nodetool decommission didn't help. Actually the node refused to decommission itself (b/c it wasn't part of the ring). So I simply stopped the process, deleted all the data directories and started it again. It worked in the sense of the node bootstrapped again but as before, after it had finished moving the data nothing happened for a long time (I'm still waiting, but nothing seems to be happening). Any hints how to analyze a stuck bootstrapping node??thanks On Tue, Jan 4, 2011 at 1:51 PM, Ran Tavory ran...@gmail.com wrote: Thanks Shimi, so indeed anticompaction was run on one of the other nodes from the same DC but to my understanding it has already ended. A few hour ago... I plenty of log messages such as [1] which ended a couple of hours ago, and I've seen the new node streaming and accepting the data from the node which performed the anticompaction and so far it was normal so it seemed that data is at its right place. But now the new node seems sort of stuck. None of the other nodes is anticompacting right now or had been anticompacting since then. The new node's CPU is close to zero, it's iostats are almost zero so I can't find another bottleneck that would keep it hanging. On the IRC someone suggested I'd maybe retry to join this node, e.g. decommission and rejoin it again. I'll try it now... [1] INFO [COMPACTION-POOL:1] 2011-01-04 04:04:09,721 CompactionManager.java (line 338) AntiCompacting
Re: The CLI sometimes gets 100 results even though there are more, and sometimes gets more than 100
I know that there's a limit, and I just assumed that the CLI set it to 100, until I saw more than 100 results. On Wed, Jan 5, 2011 at 6:56 PM, Peter Schuller peter.schul...@infidyne.comwrote: The CLI sometimes gets only 100 results (even though there are more) - and sometimes gets all the results, even when there are more than 100! What is going on here? Is there some logic that says if there are too many results return 100, even though too many can be more than 100? API calls have a limit since streaming is not supported and you could potentially have almost arbitrary large result sets. I believe cassandra-cli will allow you to set the limit if you look at the 'help' output and look for the word 'limit'. The way to iterate over large amounts of data is to do paging, with multiple queries. -- / Peter Schuller
Re: iterate over all the rows with RP
Shimi, I am using Hector to do exactly what you want to do, with no problems. (In fact, the question didn't even occur to me...) On Sun, Dec 12, 2010 at 9:03 PM, Ran Tavory ran...@gmail.com wrote: This should be the case, yes, semantics isn't affected by the connection and state isn't kept. What might happen if you read/write with low consistency levels then when you hit a different host on the ring it might have an inconsistent state in case of partition. On Sunday, December 12, 2010, shimi shim...@gmail.com wrote: So if I will use a different connection (thrift via Hector), will I get the same results? It's make sense when you use OPP and I assume it is the same with RP. I just wanted to make sure this is the case and there is no state which is kept. Shimi On Sun, Dec 12, 2010 at 8:14 PM, Peter Schuller peter.schul...@infidyne.com wrote: Is the same connection is required when iterating over all the rows with Random Paritioner or is it possible to use a different connection for each iteration? In general, the choice of RPC connection (I assume you mean the underlying thrift connection) does not affect the semantics of the RPC calls. -- / Peter Schuller -- /Ran
Re: N to N relationships
You want to store every value twice? That would be a pain to maintain, and possibly lead to inconsistent data. On Fri, Dec 10, 2010 at 3:50 AM, Nick Bailey n...@riptano.com wrote: I would also recommend two column families. Storing the key as NxN would require you to hit multiple machines to query for an entire row or column with RandomPartitioner. Even with OPP you would need to pick row or columns to order by and the other would require hitting multiple machines. Two column families avoids this and avoids any problems with choosing OPP. On Thu, Dec 9, 2010 at 2:26 PM, Aaron Morton aa...@thelastpickle.comwrote: Am assuming you have one matrix and you know the dimensions. Also as you say the most important queries are to get an entire column or an entire row. I would consider using a standard CF for the Columns and one for the Rows. The key for each would be the col / row number, each cassandra column name would be the id of the other dimension and the value whatever you want. - when storing the data update both the Column and Row CF - reading a whole row/col would be simply reading from the appropriate CF. - reading an intersection is a get_slice to either col or row CF using the column_names field to identify the other dimension. You would not need secondary indexes to serve these queries. Hope that helps. Aaron On 10 Dec, 2010,at 07:02 AM, Sébastien Druon sdr...@spotuse.com wrote: I mean if I have secondary indexes. Apparently they are calculated in the background... On 9 December 2010 18:33, David Boxenhorn da...@lookin2.com wrote: What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.comwrote: Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.comwrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Re: N to N relationships
How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.com wrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Secondary indexes change everything?
It seems to me that secondary indexes (new in 0.7) change everything when it comes to data modeling. - OOP becomes obsolete - primary indexes become obsolete if you ever want to do a range query (which you probably will...), better to assign a random row id Taken together, it's likely that very little will remain of your old database schema... Am I right?
Re: Secondary indexes change everything?
- OPP becomes obsolete (OOP is not obsolete!) - primary indexes become obsolete if you ever want to do a range query (which you probably will...), better to assign a random row id Taken together, it's likely that very little will remain of your old database schema... Am I right?
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
In other words, if you want to use QUORUM, you need to set RF=3. (I know because I had exactly the same problem.) On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: Quorum: killing 1 out of 3 server kills the cluster (?)
If that is what you want, use CL=ONE On Thu, Dec 9, 2010 at 6:43 PM, Timo Nentwig timo.nent...@toptarif.dewrote: On Dec 9, 2010, at 17:39, David Boxenhorn wrote: In other words, if you want to use QUORUM, you need to set RF=3. (I know because I had exactly the same problem.) I naively assume that if I kill either node that holds N1 (i.e. node 1 or 3), N1 will still remain on another node. Only if both fail, I actually lose data. But apparently this is not how it works... On Thu, Dec 9, 2010 at 6:05 PM, Sylvain Lebresne sylv...@yakaz.com wrote: I'ts 2 out of the number of replicas, not the number of nodes. At RF=2, you have 2 replicas. And since quorum is also 2 with that replication factor, you cannot lose a node, otherwise some query will end up as UnavailableException. Again, this is not related to the total number of nodes. Even with 200 nodes, if you use RF=2, you will have some query that fail (altough much less that what you are probably seeing). On Thu, Dec 9, 2010 at 5:00 PM, Timo Nentwig timo.nent...@toptarif.de wrote: On Dec 9, 2010, at 16:50, Daniel Lundin wrote: Quorum is really only useful when RF 2, since the for a quorum to succeed RF/2+1 replicas must be available. 2/2+1==2 and I killed 1 of 3, so... don't get it. This means for RF = 2, consistency levels QUORUM and ALL yield the same result. /d On Thu, Dec 9, 2010 at 4:40 PM, Timo Nentwig timo.nent...@toptarif.de wrote: Hi! I've 3 servers running (0.7rc1) with a replication_factor of 2 and use quorum for writes. But when I shut down one of them UnavailableExceptions are thrown. Why is that? Isn't that the sense of quorum and a fault-tolerant DB that it continues with the remaining 2 nodes and redistributes the data to the broken one as soons as its up again? What may I be doing wrong? thx tcn
Re: N to N relationships
What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.com wrote: Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.comwrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Using mySQL to emulate Cassandra
As our launch date approaches, I am getting increasingly nervous about Cassandra tuning. It is a mysterious black art that I haven't mastered even at the low usages that we have now. I know of a few more things I can do to improve things, but how will I know if it is enough? All this is particularly ironic since - as we are just starting out - we don't have scalability problems yet, though we hope to! Luckily, I have completely wrapped Cassandra in an entity mapper, so that I can easily trade in something else, perhaps temporarily, until we really need Cassandra's scalability. So, I'm thinking of emulating Cassandra with mySQL. I would use mySQL either as a simple key-value store, without joins, or map Cassandra supercolumns to mySQL columns, probably of type CLOB. Does anyone want to talk me out of this?
Taking down a node in a 3-node cluster, RF=2
For the vast majority of my data usage eventual consistency is fine (i.e. CL=ONE) but I have a small amount of critical data for which I read and write using CL=QUORUM. If I have a cluster with 3 nodes and RF=2, and CL=QUORUM does that mean that a value can be read from or written to any 2 nodes, or does it have to be the particular 2 nodes that store the data? If it is the particular 2 nodes that store the data, that means that I can't even take down one node, since it will be the mandatory 2nd node for 1/3 of my data...
Re: Taking down a node in a 3-node cluster, RF=2
Thank you, Jake. It does... except that in another context you told me: Hints only happen when a node is unavailable and you are writing with CL.ANY If you never write with CL.ANY then you can turn off hinted handoff. How do I reconcile this? On Sun, Nov 28, 2010 at 7:11 PM, Jake Luciani jak...@gmail.com wrote: If you read/write data with quorum then you can safely take a node down in this scenario. Subsequent writes will use hinted handoff to be passed to the node when it comes back up. More info is here: http://wiki.apache.org/cassandra/HintedHandoff Does that answer your question? -Jake On Sun, Nov 28, 2010 at 9:42 AM, Ran Tavory ran...@gmail.com wrote: to me it makes sense that if hinted handoff is off then cassandra cannot satisfy 2 out of every 3rd writes writes when one of the nodes is down since this node is the designated node of 2/3 writes. But I don't remember reading this somewhere. Does hinted handoff affect David's situation? (David, did you disable HH in your storage-config? HintedHandoffEnabledfalse/HintedHandoffEnabled) On Sun, Nov 28, 2010 at 4:32 PM, David Boxenhorn da...@lookin2.comwrote: For the vast majority of my data usage eventual consistency is fine (i.e. CL=ONE) but I have a small amount of critical data for which I read and write using CL=QUORUM. If I have a cluster with 3 nodes and RF=2, and CL=QUORUM does that mean that a value can be read from or written to any 2 nodes, or does it have to be the particular 2 nodes that store the data? If it is the particular 2 nodes that store the data, that means that I can't even take down one node, since it will be the mandatory 2nd node for 1/3 of my data... -- /Ran
Re: Taking down a node in a 3-node cluster, RF=2
OK. To sum up: RF=2 and QUORUM are incompatible (if you want to be able to take a node down). Right? On Sun, Nov 28, 2010 at 7:59 PM, Jake Luciani jak...@gmail.com wrote: I was wrong on this scenario and I'll explain where I was incorrect. Hints are stored for a downed node but they don't count towards meeting a consistency level. Let's take 2 scenarios: RF=6, Nodes=10 If you READ/WRITE with CL.QUORUM you will need 4 alive nodes if one is down it will still have 4 active replicas to write to, one of these will store a hint and update the downed node when it comes back. RF=2, Nodes=3 If you READ/WRITE with CL.QUORUM you need 2 live nodes. If one of these 2 are down you can't meet the QUORUM level so the write will fail. In your scenario your best bet is to update to RF=3, then any two nodes will accept QUORUM Sorry for the confusion, -Jake On Sun, Nov 28, 2010 at 12:26 PM, David Boxenhorn da...@lookin2.comwrote: Thank you, Jake. It does... except that in another context you told me: Hints only happen when a node is unavailable and you are writing with CL.ANY If you never write with CL.ANY then you can turn off hinted handoff. How do I reconcile this? On Sun, Nov 28, 2010 at 7:11 PM, Jake Luciani jak...@gmail.com wrote: If you read/write data with quorum then you can safely take a node down in this scenario. Subsequent writes will use hinted handoff to be passed to the node when it comes back up. More info is here: http://wiki.apache.org/cassandra/HintedHandoff Does that answer your question? -Jake On Sun, Nov 28, 2010 at 9:42 AM, Ran Tavory ran...@gmail.com wrote: to me it makes sense that if hinted handoff is off then cassandra cannot satisfy 2 out of every 3rd writes writes when one of the nodes is down since this node is the designated node of 2/3 writes. But I don't remember reading this somewhere. Does hinted handoff affect David's situation? (David, did you disable HH in your storage-config? HintedHandoffEnabledfalse/HintedHandoffEnabled) On Sun, Nov 28, 2010 at 4:32 PM, David Boxenhorn da...@lookin2.comwrote: For the vast majority of my data usage eventual consistency is fine (i.e. CL=ONE) but I have a small amount of critical data for which I read and write using CL=QUORUM. If I have a cluster with 3 nodes and RF=2, and CL=QUORUM does that mean that a value can be read from or written to any 2 nodes, or does it have to be the particular 2 nodes that store the data? If it is the particular 2 nodes that store the data, that means that I can't even take down one node, since it will be the mandatory 2nd node for 1/3 of my data... -- /Ran
Re: Facebook messaging and choice of HBase over Cassandra - what can we learn?
It's true that Cassandra has tunable consistency, but if eventual consistency is not sufficient for most of your use cases, Cassandra becomes much less attractive. Am I wrong? On Sun, Nov 21, 2010 at 7:56 PM, Eric Evans eev...@rackspace.com wrote: On Sun, 2010-11-21 at 11:32 -0500, Simon Reavely wrote: As a cassandra user I think the key sentence for this community is: We found Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. In my experience, we needed strong consistency, in conversations like these amounts to hand waving. It's the fastest way to shut down that part of the discussion without having said anything at all. I think it would be useful to find out more about this statement from Kannan and the facebook team. Does anyone have any contacts in the Facebook team? Good luck. Facebook is notoriously tight-lipped about such things. My goal here is to understand usage patterns and whether or not the Cassandra community can learn from this decision; maybe even understand whether the Cassandra roadmap should be influenced by this decision to address a target user base. Of course we might also conclude that its just not a Cassandra use-case! Understanding is a laudable goal, just try to avoid drawing conclusions (and call out others who are). rant This is usually the point where a frenzy kicks in and folks assume that the Smart Guys at Facebook know something they don't, something that would invalidate their decision if they'd only known. I seriously doubt they've uncovered some Truth that would fundamentally alter the reasoning behind *my* decision to use Cassandra, and so I plan to continue as I always have. Following relevant research and development, collecting experience (my own and others), and applying it to the problems I face. /rant -- Eric Evans eev...@rackspace.com
Re: Facebook messaging and choice of HBase over Cassandra - what can we learn?
Yes, but the value is supposed to be 11, since the write failed. On Mon, Nov 22, 2010 at 2:27 PM, André Fiedler fiedler.an...@googlemail.com wrote: Doesn´t sync Cassandra all nodes if the network is up again? I think this was one of the reasons, storing a timestamp at every key/value pair? So i think the response will only temporary be 11. If all nodes have synct it should be 12? Or isn´t that so? greetings André 2010/11/22 Samuel Carrière samuel.carri...@gmail.com Cassandra can work in a consistent way, see some of this discussion and the Consistency section here http://wiki.apache.org/cassandra/ArchitectureOverview If you always read and write with CL.Quorum (or the other way discussed) you will have consistency. Even if some of the replicas are temporarily inconsistent, or off line or whatever. Your reads will be consistent, i.e. every client will get the same value or the read will not work. If you want to work at a lower or higher consistency you can. Eventually all replicas of a value will become consistent. There are a number of reasons why cassandra may not be a good fit, and I would guess something else would be a problem before the consistency model. Hope that helps. Aaron Hello, I like cassandra a lot and I'm sure it can be used in many use cases, but I'm not sure we can say that we have strong consistency, even if we read and write with CL.Quorum. Firstly, we can only expect consistency at the column level. Reading and writing with CL.Quorum gives you most of the time a consistent value for each individual column, but it does not mean if gives you a consistent view of your data. (Because cassandra gives you no isolation and no transactions, your application has to deal with data inconsistencies). Secondly, I may be wrong, but I'm not sure consistency at the column level is guaranteed. Here is an example, with a replication factor of 3. Imagine that the current value of col1 is 11. Your application tries to write col1 = 12 with CL.Quorum. Imagine the write arrives to node 1, but that the new value is not transmitted to nodes 2 and 3 because of network failures. So the write fails (this is the expected behaviour), but node 1 still has the new value (there is no rollback). Then, imagine that the network is back to normal, and that another client asked for the value of col1, with CL.Quorum. Here, the value of the response is not guaranteed. If the client asks for the value to node 2 and node 3, the response will be 11, but if he asks to node 1 and node 2 or 3, the response will be 12. Am I missing something ? Samuel
Consulting for Rollout + Cassandra
We are planning a rollout of our online product ~September 1. Cassandra is a major part of our online system. We need some Cassandra consulting + general online consulting for determining our server configuration so it will support Cassandra under all possible scenarios. Does anybody have any ideas for us? Thanks!
OPP + Hash on client side
Is there any strategy for using OPP with a hash algorithm on the client side to get both uniform distribution of data in the cluster *and* the ability to do range queries? I'm thinking of something like this: cassKey = (key % 97) + @ + key; cassRange = 0 + @ + range; 1 + @ + range; ... 96 + @ + range; Would something like that work?
Re: OPP + Hash on client side
Aaron, thank you for the link. What is discussed there is not exactly what I am thinking of. They propose distributing the keys with MD5(ROWKEY).ROWKEY - which will distribute the values in a way that cannot easily be reversed. What I am proposing is to distribute the keys evenly among N buckets, where N is much larger than your number of nodes, and then construct my range queries as the union of N range queries that I actually perform on Cassandra. You can do range queries with the Random Partitioner in 0.6.* I went though this before, it's not true. What you can do is loop over your entire set of keys in random order. There is no way to get an actual range other than the whole range. On Wed, Jul 7, 2010 at 1:15 PM, Aaron Morton aa...@thelastpickle.comwrote: That pattern is discussed here http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/ It's also used in http://github.com/tjake/Lucandra You can do range queries with the Random Partitioner in 0.6.*, the order of the return is undefined and it's a bit slower. I think it's normally used when you want ordered range queries in some CF's and random distribution in others. Aaron On 07 Jul, 2010,at 09:47 PM, David Boxenhorn da...@lookin2.com wrote: Is there any strategy for using OPP with a hash algorithm on the client side to get both uniform distribution of data in the cluster *and* the ability to do range queries? I'm thinking of something like this: cassKey = (key % 97) + @ + key; cassRange = 0 + @ + range; 1 + @ + range; ... 96 + @ + range; Would something like that work?