Re: How to know disk utilization by each row on a node
Did you use cfstats and cfhistograms? On Jan 22, 2015 12:37 AM, Edson Marquezani Filho edsonmarquez...@gmail.com wrote: Ok, nice tool, but I still can't see how much data each row occupies on the SSTable (or am I missing something?). Obs: considering SSTables format, where rows are strictly sequential and sorted, a feature like that doesn't seem something very hard to implement, anyway. Wouldn't it be possible to calculate it only from index files, without even needing to read the actual table? On Tue, Jan 20, 2015 at 5:05 PM, Jens Rantil jens.ran...@tink.se wrote: Hi, Datastax comes with sstablekeys that does that. You could also use sstable2json script to find keys. Cheers, Jens On Tue, Jan 20, 2015 at 2:53 PM, Edson Marquezani Filho edsonmarquez...@gmail.com wrote: Hello, everybody. Does anyone know a way to list, for an arbitrary column family, all the rows owned (including replicas) by a given node and the data size (real size or disk occupation) of each one of them on that node? I would like to do that because I have data on one of my nodes growing faster than the others, although rows (and replicas) seem evenly distributed across the cluster. So, I would like to verify if I have some specific rows growing too much. Thank you.
Re: How to know disk utilization by each row on a node
Ok, nice tool, but I still can't see how much data each row occupies on the SSTable (or am I missing something?). Obs: considering SSTables format, where rows are strictly sequential and sorted, a feature like that doesn't seem something very hard to implement, anyway. Wouldn't it be possible to calculate it only from index files, without even needing to read the actual table? On Tue, Jan 20, 2015 at 5:05 PM, Jens Rantil jens.ran...@tink.se wrote: Hi, Datastax comes with sstablekeys that does that. You could also use sstable2json script to find keys. Cheers, Jens On Tue, Jan 20, 2015 at 2:53 PM, Edson Marquezani Filho edsonmarquez...@gmail.com wrote: Hello, everybody. Does anyone know a way to list, for an arbitrary column family, all the rows owned (including replicas) by a given node and the data size (real size or disk occupation) of each one of them on that node? I would like to do that because I have data on one of my nodes growing faster than the others, although rows (and replicas) seem evenly distributed across the cluster. So, I would like to verify if I have some specific rows growing too much. Thank you.
Re: Compaction failing to trigger
On Wed, Jan 21, 2015 at 10:10 AM, Flavien Charlon flavien.char...@gmail.com wrote: https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ This doesn't really answer my question, I asked whether this particular bug (which I can't find in JIRA) is planned to be fixed in 2.1.3, not whether 2.1.3 would be production ready. No idea, but as I didn't recognize your name/email and you were encountering problems with an IMO not-ready-for-production version. Many people who are new to Cassandra and pre-or-close-to-production might be better served by running a slightly older version and focusing on the challenge of writing their app against a mostly-working distributed database instead of troubleshooting Cassandra bugs. tl;dr - Cassandra bugs in cutting edge versions are likely best encountered by experienced operators who can recognize them and respond, not new operators. While we're on this topic, the version numbering is very misleading. Version which are not recommended for production should be very explicitly labelled as such (beta for example), and 2.1.0 should really be what you call now 2.1.6. That's why I wrote the blog post. It is however important to note that I speak in no official capacity for Apache Cassandra or Datastax. The intent of the project is for x.y.0 to be production ready, and in fairness they have recently added new QA processes which are likely to drive the production ready version down from x.y.6. They are only human, however, and as human developers are likely have slightly different (lower) standards for production readiness than the typical operator. I wrote that blog post to help set operator-appropriate expectations, so people are not disappointed with the overall stability of Cassandra. I personally operate Cassandra slightly on the trailing edge, and as a result only encounter a limited subset of the problems I assist people with on the list and IRC. =Rob
Re: Compaction failing to trigger
What version of Cassandra are you running? 2.1.2 Are they all live? Are there pending compactions, or exceptions regarding compactions in your logs? Yes they are all live according to cfstats. There is no pending compaction or exception in the logs. https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ This doesn't really answer my question, I asked whether this particular bug (which I can't find in JIRA) is planned to be fixed in 2.1.3, not whether 2.1.3 would be production ready. While we're on this topic, the version numbering is very misleading. Version which are not recommended for production should be very explicitly labelled as such (beta for example), and 2.1.0 should really be what you call now 2.1.6. Setting 'cold_reads_to_omit' to 0 did the job for me Thanks, I've tried it, and it works. This should probably be made the default IMO. Flavien On 20 January 2015 at 22:51, Eric Stevens migh...@gmail.com wrote: @Rob - he's probably referring to the thread titled Reasons for nodes not compacting? where Tyler speculates that the tables are falling below the cold read threshold for compaction. He speculated it may be a bug. At the same time in a different thread, Roland had a similar problem, and Tyler's proposed workaround seemed to work for him. On Tue, Jan 20, 2015 at 3:35 PM, Robert Coli rc...@eventbrite.com wrote: On Sun, Jan 18, 2015 at 6:06 PM, Flavien Charlon flavien.char...@gmail.com wrote: It's set on all the tables, as I'm using the default for all the tables. But for that particular table there are 41 SSTables between 60MB and 85MB, it should only take 4 for the compaction to kick in. What version of Cassandra are you running? Are they all live? Are there pending compactions, or exceptions regarding compactions in your logs? As this is probably a bug and going back in the mailing list archive, it seems it's already been reported: This is a weird statement. Are you saying that you've found it in the mailing list archives? If so, why not paste the threads so those of us who might remember can refer to them? - Will it be fixed in 2.1.3? https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ =Rob
Re: How do replica become out of sync
Quite a few, see here: http://pastebin.com/SMnprHdp. In total about 3,000 ranges across the 3 nodes. This is with vnodes disabled. It was at least an order of magnitude worse when we had it enabled. Flavien On 20 January 2015 at 22:22, Robert Coli rc...@eventbrite.com wrote: On Mon, Jan 19, 2015 at 5:44 PM, Flavien Charlon flavien.char...@gmail.com wrote: Thanks Andi. The reason I was asking is that even though my nodes have been 100% available and no write has been rejected, when running an incremental repair, the logs still indicate that some ranges are out of sync (which then results in large amounts of compaction), how can this be possible? This is most likely, as you conjecture, due to slight differences between nodes at the time of Merkle Tree calculation. How many rows differ? =Rob
Re: Re: Dynamic Columns
the example you provided does not work for for my use case. CREATE TABLE t ( key blob, static my_static_column_1 int, static my_static_column_2 float, static my_static_column_3 blob, , dynamic_column_name blob, dynamic_column_value blob, PRIMARY KEY (key, dynamic_column_name); ) the dynamic column can't be part of the primary key. The temporal entity key can be the default UUID or the user can choose the field in their object. Within our framework, we have concept of temporal links between one or more temporal entities. Poluting the primary key with the dynamic column wouldn't work. Please excuse the confusing RDB comparison. My point is that Cassandra's dynamic column feature is the unique feature that makes it better than traditional RDB or newSql like VoltDB for building temporal databases. With databases that require static schema + alter table for managing schema evolution, it makes it harder and results in down time. One of the challenges of data management over time is evolving the data model and making queries simple. If the record is 5 years old, it probably has a difference schema than a record inserted this week. With temporal databases, every update is an insert, so it's a little bit more complex than just use a blob. There's a whole level of complication with temporal data and CQL3 custom types isn't clear to me. I've read the CQL3 documentation on the custom types several times and it is rather poor. It gives me the impression there's still work needed to get custom types in good shape. With regard to examples others have told me, your advice is fair. A few minutes with google and some blogs should pop up. The reason I bring these things up isn't to put down CQL. It's because I care and want to help improve Cassandra by sharing my experience. I consistently recommend new users learn and understand both Thrift and CQL. On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin wool...@gmail.com wrote: I don't remember other people's examples in detail due to my shitty memory, so I'd rather not misquote. Fair enough, but maybe you shouldn't use people's examples you don't remenber as argument then. Those examples might be wrong or outdated and that kind of stuff creates confusion for everyone. In my case, I mix static and dynamic columns in a single column family with primitives and objects. The objects are temporal object graphs with a known type. Doing this type of stuff is basically transparent for me, since I'm using thrift and our data modeler generates helper classes. Our tooling seamlessly convert the bytes back to the target object. We have a few standard static columns related to temporal metadata. At any time, dynamic columns can be added and they can be primitives or objects. I don't see anything in that that cannot be done with CQL. You can mix static and dynamic columns in CQL thanks to static columns. More precisely, you can do what you're describing with a table looking a bit like this: CREATE TABLE t ( key blob, static my_static_column_1 int, static my_static_column_2 float, static my_static_column_3 blob, , dynamic_column_name blob, dynamic_column_value blob, PRIMARY KEY (key, dynamic_column_name); ) And your helper classes will serialize your objects as they probably do today (if you use a custom comparator, you can do that too). And let it be clear that I'm not pretending that doing it this way is tremendously simpler than thrift. But I'm saying that 1) it's possible and 2) while it's not meaningfully simpler than thriftMy , it's not really harder either (and in fact, it's actually less verbose with CQL than with raw thrift). For the record, doing this kind of stuff in a relational database sucks horribly. I don't know what that has to do with CQL to be honest. If you're doing relational with CQL you're doing it wrong. And please note that I'm not saying CQL is the perfect API for modeling temporal data. But I don't get how thrift, which is very crude API, is a much better API at that than CQL (or, again, how it allows you to do things you can't with CQL). -- Sylvain
Re: Re: Dynamic Columns
On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin wool...@gmail.com wrote: I don't remember other people's examples in detail due to my shitty memory, so I'd rather not misquote. Fair enough, but maybe you shouldn't use people's examples you don't remenber as argument then. Those examples might be wrong or outdated and that kind of stuff creates confusion for everyone. In my case, I mix static and dynamic columns in a single column family with primitives and objects. The objects are temporal object graphs with a known type. Doing this type of stuff is basically transparent for me, since I'm using thrift and our data modeler generates helper classes. Our tooling seamlessly convert the bytes back to the target object. We have a few standard static columns related to temporal metadata. At any time, dynamic columns can be added and they can be primitives or objects. I don't see anything in that that cannot be done with CQL. You can mix static and dynamic columns in CQL thanks to static columns. More precisely, you can do what you're describing with a table looking a bit like this: CREATE TABLE t ( key blob, static my_static_column_1 int, static my_static_column_2 float, static my_static_column_3 blob, , dynamic_column_name blob, dynamic_column_value blob, PRIMARY KEY (key, dynamic_column_name); ) And your helper classes will serialize your objects as they probably do today (if you use a custom comparator, you can do that too). And let it be clear that I'm not pretending that doing it this way is tremendously simpler than thrift. But I'm saying that 1) it's possible and 2) while it's not meaningfully simpler than thriftMy , it's not really harder either (and in fact, it's actually less verbose with CQL than with raw thrift). For the record, doing this kind of stuff in a relational database sucks horribly. I don't know what that has to do with CQL to be honest. If you're doing relational with CQL you're doing it wrong. And please note that I'm not saying CQL is the perfect API for modeling temporal data. But I don't get how thrift, which is very crude API, is a much better API at that than CQL (or, again, how it allows you to do things you can't with CQL). -- Sylvain
Re: Re: Dynamic Columns
I don't remember other people's examples in detail due to my shitty memory, so I'd rather not misquote. In my case, I mix static and dynamic columns in a single column family with primitives and objects. The objects are temporal object graphs with a known type. Doing this type of stuff is basically transparent for me, since I'm using thrift and our data modeler generates helper classes. Our tooling seamlessly convert the bytes back to the target object. We have a few standard static columns related to temporal metadata. At any time, dynamic columns can be added and they can be primitives or objects. The framework we built uses CQL for basic queries and views the user defines. We model the schema in a GUI modeler and the framework provides a query API to access a specific version or versions of any record. The design borrows heavily from temporal logic and active databases. For the record, doing this kind of stuff in a relational database sucks horribly. The reason I chose to build a temporal database on Cassandra is because I've done it on oracle/sqlserver in the past. Last year I submitted a talk about our temporal database for the datastax conference, but it was rejected since there were too many submissions. I know spotify also built a temporal database on Cassandra and they gave a talk on what they did. peter On Wed, Jan 21, 2015 at 10:13 AM, Sylvain Lebresne sylv...@datastax.com wrote: I've chatted with several long time users of Cassandra and there's things CQL3 doesn't support. Would you care to elaborate then? Maybe a simple example of something (or multiple things since you used plural) in thrift that cannot be supported in CQL? And please note that I'm *not* saying that all existing thrift table can be seemlessly used from CQL: there is indeed a few cases for which that's not the case. But that does not mean those cases cannot easily be in CQL from scratch.
Re: get partition key from tombstone warnings?
There is an open ticket for this improvement at https://issues.apache.org/jira/browse/CASSANDRA-8561 On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose ianr...@fullstory.com wrote: When I see a warning like Read 9 live and 5769 tombstoned cells in ... etc is there a way for me to see the partition key that this query was operating on? The description in the original JIRA ticket ( https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though exposing this information was one of the original goals, but it isn't obvious to me in the logs... Cheers! - Ian
Re: Re: Dynamic Columns
On Wed, Jan 21, 2015 at 6:19 PM, Peter Lin wool...@gmail.com wrote: the dynamic column can't be part of the primary key. The temporal entity key can be the default UUID or the user can choose the field in their object. Within our framework, we have concept of temporal links between one or more temporal entities. Poluting the primary key with the dynamic column wouldn't work. Not totally sure I understand. Are you talking about the underlying storage space used? If you are, we can discuss it (it's not too hard to remedy it in CQL, I was mainly trying to illustrating my point, not pretending this was a drop-in solution for your use case) but it's more of a performance discussion, and I think we've somewhat quit the realm of there's things CQL3 doesn't support. Please excuse the confusing RDB comparison. My point is that Cassandra's dynamic column feature is the unique feature that makes it better than traditional RDB or newSql like VoltDB for building temporal databases. With databases that require static schema + alter table for managing schema evolution, it makes it harder and results in down time. Here again you seem you imply that CQL doesn't support dynamic columns, or has a somewhat inferior support, but that's just not true. One of the challenges of data management over time is evolving the data model and making queries simple. If the record is 5 years old, it probably has a difference schema than a record inserted this week. With temporal databases, every update is an insert, so it's a little bit more complex than just use a blob. There's a whole level of complication with temporal data and CQL3 custom types isn't clear to me. I've read the CQL3 documentation on the custom types several times and it is rather poor. It gives me the impression there's still work needed to get custom types in good shape. I'm sorry but that's a bit of hand waving. Custom types (and by that I mean user-provided AbstractType implementations) works in CQL *exactly* like in thrift: they are not in a better or worse shape than in thrift. And while the documentation on CQL3 is indeed poor on this part, so is the thrift documentation on the same subject (besides, I don't think you're whole point is about saying that documentation could be improved). Again, what you can do in thrift, you can do in CQL. I consistently recommend new users learn and understand both Thrift and CQL. I understand that you do this with the best of intentions and don't take it the wrong way but it is my opinion that you are counterproductive by doing so, and this for 2 reasons: 1) you don't only recommend users to learn both API, you justify that advice by affirming that there is a whole family of important use cases that thrift supports and CQL do not. Except that I pretend tat this affirmation is technically incorrect, and so far I haven't seen much example proving me wrong. 2) there is a wealth of evidence that trying to learn both thrift and CQL confuses the hell out of new users. Which is btw not surprising, both API presents the same concepts in seemingly different way (even though they do are the same concepts) and even have conflicting vocabulary, so it's obviously confusing when you try to learn those concepts in the first place. Trying to learn CQL when you know thrift well is fine, and why not learn thrift once you know and understand CQL well, but learning both is imo a bad advice. It could maybe (maybe) be justified if what you say about having whole family of use cases not being doable with CQL was true, but it's not. -- Sylvain On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin wool...@gmail.com wrote: I don't remember other people's examples in detail due to my shitty memory, so I'd rather not misquote. Fair enough, but maybe you shouldn't use people's examples you don't remenber as argument then. Those examples might be wrong or outdated and that kind of stuff creates confusion for everyone. In my case, I mix static and dynamic columns in a single column family with primitives and objects. The objects are temporal object graphs with a known type. Doing this type of stuff is basically transparent for me, since I'm using thrift and our data modeler generates helper classes. Our tooling seamlessly convert the bytes back to the target object. We have a few standard static columns related to temporal metadata. At any time, dynamic columns can be added and they can be primitives or objects. I don't see anything in that that cannot be done with CQL. You can mix static and dynamic columns in CQL thanks to static columns. More precisely, you can do what you're describing with a table looking a bit like this: CREATE TABLE t ( key blob, static my_static_column_1 int, static my_static_column_2 float, static my_static_column_3 blob, , dynamic_column_name blob,
get partition key from tombstone warnings?
When I see a warning like Read 9 live and 5769 tombstoned cells in ... etc is there a way for me to see the partition key that this query was operating on? The description in the original JIRA ticket ( https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though exposing this information was one of the original goals, but it isn't obvious to me in the logs... Cheers! - Ian
Re: Re: Dynamic Columns
I apologize if I've offended you, but I clearly stated CQL3 supports dynamic columns. How it supports dynamic columns is different. If I'm reading you correctly, I believe we agree both thrift and CQL3 support dynamic columns. Where we differ that I feel the coverage for existing thrift use cases isn't 100%. That may be right or wrong, but it is my impression. I agree with you that CQL3 supports the majority of dynamic column use cases, but in a slightly different way. There are cases like mine which fit better in thrift. Could I rip out all the stuff I did and replace it with CQL3 with a major redesign? Yes, I could but honestly I see some downsides with that proposition. 1. for modeling tools like mine an object API is a far better fit in my bias opinion 2. text based languages like SQL and CQL could in theory provide similar object safety, but it's so much work that most people don't bother. This is from first hand experience building 3 orms and using most of the open source orms in the java space. I've also used several orms in .Net and they all suffer from this pain point. There's a reason why microsoft created Linq. 3. the structure and syntax of SQL and all variations of SQL are not ideally suited to complex data structures that are graphs. A temporal entity is an object graph that may be shallow (3-8 levels) or deep (15+). SQL is ideally suited to tables. CQL in this regard is more flexible and supports collections, but it's still not ideal for things like insurance policies. Look at the Acord standard for property insurance, if you want to get a better understanding. For example, a temporal record using ORM could result in 500 rows of data in a dozen tables for a small entity to 50K+ rows for a large entity. The mailing list isn't the right place to go into the theory and practice of temporal databases, but a lot of the design choices I made is based on formal logic. On Wed, Jan 21, 2015 at 4:06 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Jan 21, 2015 at 6:19 PM, Peter Lin wool...@gmail.com wrote: the dynamic column can't be part of the primary key. The temporal entity key can be the default UUID or the user can choose the field in their object. Within our framework, we have concept of temporal links between one or more temporal entities. Poluting the primary key with the dynamic column wouldn't work. Not totally sure I understand. Are you talking about the underlying storage space used? If you are, we can discuss it (it's not too hard to remedy it in CQL, I was mainly trying to illustrating my point, not pretending this was a drop-in solution for your use case) but it's more of a performance discussion, and I think we've somewhat quit the realm of there's things CQL3 doesn't support. Please excuse the confusing RDB comparison. My point is that Cassandra's dynamic column feature is the unique feature that makes it better than traditional RDB or newSql like VoltDB for building temporal databases. With databases that require static schema + alter table for managing schema evolution, it makes it harder and results in down time. Here again you seem you imply that CQL doesn't support dynamic columns, or has a somewhat inferior support, but that's just not true. One of the challenges of data management over time is evolving the data model and making queries simple. If the record is 5 years old, it probably has a difference schema than a record inserted this week. With temporal databases, every update is an insert, so it's a little bit more complex than just use a blob. There's a whole level of complication with temporal data and CQL3 custom types isn't clear to me. I've read the CQL3 documentation on the custom types several times and it is rather poor. It gives me the impression there's still work needed to get custom types in good shape. I'm sorry but that's a bit of hand waving. Custom types (and by that I mean user-provided AbstractType implementations) works in CQL *exactly* like in thrift: they are not in a better or worse shape than in thrift. And while the documentation on CQL3 is indeed poor on this part, so is the thrift documentation on the same subject (besides, I don't think you're whole point is about saying that documentation could be improved). Again, what you can do in thrift, you can do in CQL. Honestly I haven't I tried to use CQL3 user provided type. I read the specification several times and had a ton of questions along with several other people that were trying to under what it meant. If you want people to use it, the documentation needs to improve. I did give a good faith effort and spent a week trying to understand what the spec is trying to say, but it only resulted in more questions. So yes, I am hand waving because it left me frustrated. Having been part of apache community for many years, writing great docs is hard and most of us hate doing it. Just to be clear, I'm not blaming anyone for poor docs. I'm just
Re: Re: Dynamic Columns
On Wed, Jan 21, 2015 at 9:19 AM, Peter Lin wool...@gmail.com wrote: I consistently recommend new users learn and understand both Thrift and CQL. FWIW, I consider this a disservice to new users. New users should use CQL, and not deploy against a deprecated-in-all-but-name API. Understanding non-CQL *storage* might be valuable, understanding the Thrift interface to storage is anti-valuable. Despite the dissembling public statements regarding Thrift not going anywhere it is obvious to me that no other databases exist with two non-pluggable and incompatible APIs for a reason. The pain of maintaining these two APIs will eventually become not worth the backwards compatibility. At this time it will be deprecated and then shortly thereafter removed; I expect this to happen at latest by EOY 2018. [1] =Rob [1] If anyone strongly disagrees, I am taking $20 cash bets, with any proceeds donated to the Apache Foundation.
Re: Re: Dynamic Columns
everyone is different. I also recommend users take time to understanding every tool they use as much as time allows. We don't always have the luxury of time, but I see no point recommending laziness. I'm probably insane, since I also spend time reading papers on CRDT, paxos, query compilers, machine learning and other topics I find fun. on the topic of multiple incompatible API's I recommend you look at SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible API. Though in some cases, it is/was unavoidable. On Wed, Jan 21, 2015 at 4:47 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jan 21, 2015 at 9:19 AM, Peter Lin wool...@gmail.com wrote: I consistently recommend new users learn and understand both Thrift and CQL. FWIW, I consider this a disservice to new users. New users should use CQL, and not deploy against a deprecated-in-all-but-name API. Understanding non-CQL *storage* might be valuable, understanding the Thrift interface to storage is anti-valuable. Despite the dissembling public statements regarding Thrift not going anywhere it is obvious to me that no other databases exist with two non-pluggable and incompatible APIs for a reason. The pain of maintaining these two APIs will eventually become not worth the backwards compatibility. At this time it will be deprecated and then shortly thereafter removed; I expect this to happen at latest by EOY 2018. [1] =Rob [1] If anyone strongly disagrees, I am taking $20 cash bets, with any proceeds donated to the Apache Foundation.
Re: Re: Dynamic Columns
I've written my fair share of crappy code, which became legacy. then I or someone else was left with supporting it and something newer. Isn't that the nature of software development. I forget who said this quote first, but I'm gonna borrow it only pretty code is code that is in your head. once it's written, it becomes crap. I tell my son this all the time. When we start a project we have no clue what we should have known, so we make a butt load of mistakes. If we're lucky, by the third or forth version it's not so smelly, but in the mean time we have to keep supporting the stuff. Not because we want to, but because we're the ones that put the users through it. Atleast that's how I see it. having said that, at some point, the really old stuff should be deprecated and cleaned out. It totally makes sense to remove thrift at some point. I don't know when that is, but every piece of software eventually dies or is abandoned. Except for Cobol. That thing will be around 200 yrs from now On Wed, Jan 21, 2015 at 6:57 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jan 21, 2015 at 2:09 PM, Peter Lin wool...@gmail.com wrote: on the topic of multiple incompatible API's I recommend you look at SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible API. Though in some cases, it is/was unavoidable. My bet is that the small development team responsible for Cassandra does not have anything like the number of contractual obligations that commercial databases from the 1980s had. In other words, I believe having two persistent, non-pluggable (this attribute probably excludes various legacy APIs?) APIs is far more avoidable in the Cassandra case than in the historic cases you cite. I could certainly be wrong... people who disagree with my assessment now have a way to make me pay for my wrongness by making me donate $20 to the Apache Foundation on Jan 1, 2019. [1] :D =Rob [1] Project committers/others with material ability (Datastax...) to affect outcome ineligible.
Re: Re: Dynamic Columns
On Wed, Jan 21, 2015 at 2:09 PM, Peter Lin wool...@gmail.com wrote: on the topic of multiple incompatible API's I recommend you look at SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible API. Though in some cases, it is/was unavoidable. My bet is that the small development team responsible for Cassandra does not have anything like the number of contractual obligations that commercial databases from the 1980s had. In other words, I believe having two persistent, non-pluggable (this attribute probably excludes various legacy APIs?) APIs is far more avoidable in the Cassandra case than in the historic cases you cite. I could certainly be wrong... people who disagree with my assessment now have a way to make me pay for my wrongness by making me donate $20 to the Apache Foundation on Jan 1, 2019. [1] :D =Rob [1] Project committers/others with material ability (Datastax...) to affect outcome ineligible.
Re: Re: Dynamic Columns
Peter, At least from your description, the proposed use of the clustering column name seems at first blush to fully fit the bill. The point is not that the resulting clustered primary key is used to reference an object, but that a SELECT on the partition key references the entire object, which will be a sequence of CQL3 rows in a partition, and then the clustering column key is added when you wish to access that specific aspect of the object. What's missing? Again, just store the partition key to reference the full object - no pollution required! And please note that any number of clustering columns can be specified, so more structured dynamic columns can be supported. For example, you could have a timestamp as a separate clustering column to maintain temporal state of the database. The partition key can also be structured from multiple columns as a composite partition key as well. As far as all these static columns, consider them optional and merely an optimization. If you wish to have a 100% opaque object model, you wouldn't have any static columns and the only non-primary key column would be the blob value field. Every object attribute would be specified using another clustering column name and blob value. Presto, everything you need for a pure, opaque, fully-generalized object management system - all with just CQL3. Maybe we should include such an example in the doc and with the project to more strongly emphasize this capability to fully model arbitrarily complex object structures - including temporal structures. Anything else missing? As a general proposition, you can use the term clustering column in CQL3 wherever you might have used dynamic column in Thrift. The point in CQL3 is not to eliminate a useful feature, dynamic column, but to repackage the feature to make a lot more sense for the vast majority of use cases. Maybe there are some cases that doesn't exactly fit as well as desired, but feel free to specifically identify such cases so that we can elaborate how we think they are covered or at least covered well enough for most users. -- Jack Krupansky On Wed, Jan 21, 2015 at 12:19 PM, Peter Lin wool...@gmail.com wrote: the example you provided does not work for for my use case. CREATE TABLE t ( key blob, static my_static_column_1 int, static my_static_column_2 float, static my_static_column_3 blob, , dynamic_column_name blob, dynamic_column_value blob, PRIMARY KEY (key, dynamic_column_name); ) the dynamic column can't be part of the primary key. The temporal entity key can be the default UUID or the user can choose the field in their object. Within our framework, we have concept of temporal links between one or more temporal entities. Poluting the primary key with the dynamic column wouldn't work. Please excuse the confusing RDB comparison. My point is that Cassandra's dynamic column feature is the unique feature that makes it better than traditional RDB or newSql like VoltDB for building temporal databases. With databases that require static schema + alter table for managing schema evolution, it makes it harder and results in down time. One of the challenges of data management over time is evolving the data model and making queries simple. If the record is 5 years old, it probably has a difference schema than a record inserted this week. With temporal databases, every update is an insert, so it's a little bit more complex than just use a blob. There's a whole level of complication with temporal data and CQL3 custom types isn't clear to me. I've read the CQL3 documentation on the custom types several times and it is rather poor. It gives me the impression there's still work needed to get custom types in good shape. With regard to examples others have told me, your advice is fair. A few minutes with google and some blogs should pop up. The reason I bring these things up isn't to put down CQL. It's because I care and want to help improve Cassandra by sharing my experience. I consistently recommend new users learn and understand both Thrift and CQL. On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin wool...@gmail.com wrote: I don't remember other people's examples in detail due to my shitty memory, so I'd rather not misquote. Fair enough, but maybe you shouldn't use people's examples you don't remenber as argument then. Those examples might be wrong or outdated and that kind of stuff creates confusion for everyone. In my case, I mix static and dynamic columns in a single column family with primitives and objects. The objects are temporal object graphs with a known type. Doing this type of stuff is basically transparent for me, since I'm using thrift and our data modeler generates helper classes. Our tooling seamlessly convert the bytes back to the target object. We have a few standard static columns related to
Re: Is there a way to add a new node to a cluster but not sync old data?
Thanks for the reply. The bootstrap of new node put a heavy burden on the whole cluster and I don't know why. So that' the issue I want to fix actually. On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens migh...@gmail.com wrote: Yes, but it won't do what I suspect you're hoping for. If you disable auto_bootstrap in cassandra.yaml the node will join the cluster and will not stream any old data from existing nodes. The cluster will now be in an inconsistent state. If you bring enough nodes online this way to violate your read consistency level (eg RF=3, CL=Quorum, if you bring on 2 nodes this way), some of your queries will be missing data that they ought to have returned. There is no way to bring a new node online and have it be responsible just for new data, and have no responsibility for old data. It *will* be responsible for old data, it just won't *know* about the old data it should be responsible for. Executing a repair will fix this, but only because the existing nodes will stream all the missing data to the new node. This will create more pressure on your cluster than just normal bootstrapping would have. I can't think of any reason you'd want to do that unless you needed to grow your cluster really quickly, and were ok with corrupting your old data. On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang bluefl...@gmail.com wrote: Hi there, I am using C* 2.0.10 and I was trying to add a new node to a cluster(actually replace a dead node). But after added the new node some other nodes in the cluster had a very high work-load and affected the whole performance of the cluster. So I am wondering is there a way to add a new node and this node only afford new data?
Fwd: ReadTimeoutException in Cassandra 2.0.11
Hello All, I am trying to process 200MB file. I am getting following Error. We are using (apache-cassandra-2.0.3.jar) com.datastax.driver.core. exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded) 1. Is it due to memory? 2. Is it related to driver? Initially when I was trying 15MB and it was throwing the same Exception but after that it started working. thanks regards neha
Re: Is there a way to add a new node to a cluster but not sync old data?
Yes, bootstrapping a new node will cause read loads on your existing nodes - it is becoming the owner and replica of a whole new set of existing data. To do that it needs to know what data it's now responsible for, and that's what bootstrapping is for. If you're at the point where bootstrapping a new node is placing a too-heavy burden on your existing nodes, you may be dangerously close to or even past the tipping point where you ought to have already grown your cluster. You need to grow your cluster as soon as possible, and chances are you're close to no longer being able to keep up with compaction (see nodetool compactionstats, make sure pending tasks is 5, preferably 0 or 1). Once you're falling behind on compaction, it becomes difficult to successfully bootstrap new nodes, and you're in a very tough spot. On Wed, Jan 21, 2015 at 7:43 PM, Yatong Zhang bluefl...@gmail.com wrote: Thanks for the reply. The bootstrap of new node put a heavy burden on the whole cluster and I don't know why. So that' the issue I want to fix actually. On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens migh...@gmail.com wrote: Yes, but it won't do what I suspect you're hoping for. If you disable auto_bootstrap in cassandra.yaml the node will join the cluster and will not stream any old data from existing nodes. The cluster will now be in an inconsistent state. If you bring enough nodes online this way to violate your read consistency level (eg RF=3, CL=Quorum, if you bring on 2 nodes this way), some of your queries will be missing data that they ought to have returned. There is no way to bring a new node online and have it be responsible just for new data, and have no responsibility for old data. It *will* be responsible for old data, it just won't *know* about the old data it should be responsible for. Executing a repair will fix this, but only because the existing nodes will stream all the missing data to the new node. This will create more pressure on your cluster than just normal bootstrapping would have. I can't think of any reason you'd want to do that unless you needed to grow your cluster really quickly, and were ok with corrupting your old data. On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang bluefl...@gmail.com wrote: Hi there, I am using C* 2.0.10 and I was trying to add a new node to a cluster(actually replace a dead node). But after added the new node some other nodes in the cluster had a very high work-load and affected the whole performance of the cluster. So I am wondering is there a way to add a new node and this node only afford new data?
Re: Is there a way to add a new node to a cluster but not sync old data?
Yes, my cluster is almost full and there are lots of pending tasks. You helped me a lot and thank you Eric~ On Thu, Jan 22, 2015 at 11:59 AM, Eric Stevens migh...@gmail.com wrote: Yes, bootstrapping a new node will cause read loads on your existing nodes - it is becoming the owner and replica of a whole new set of existing data. To do that it needs to know what data it's now responsible for, and that's what bootstrapping is for. If you're at the point where bootstrapping a new node is placing a too-heavy burden on your existing nodes, you may be dangerously close to or even past the tipping point where you ought to have already grown your cluster. You need to grow your cluster as soon as possible, and chances are you're close to no longer being able to keep up with compaction (see nodetool compactionstats, make sure pending tasks is 5, preferably 0 or 1). Once you're falling behind on compaction, it becomes difficult to successfully bootstrap new nodes, and you're in a very tough spot. On Wed, Jan 21, 2015 at 7:43 PM, Yatong Zhang bluefl...@gmail.com wrote: Thanks for the reply. The bootstrap of new node put a heavy burden on the whole cluster and I don't know why. So that' the issue I want to fix actually. On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens migh...@gmail.com wrote: Yes, but it won't do what I suspect you're hoping for. If you disable auto_bootstrap in cassandra.yaml the node will join the cluster and will not stream any old data from existing nodes. The cluster will now be in an inconsistent state. If you bring enough nodes online this way to violate your read consistency level (eg RF=3, CL=Quorum, if you bring on 2 nodes this way), some of your queries will be missing data that they ought to have returned. There is no way to bring a new node online and have it be responsible just for new data, and have no responsibility for old data. It *will* be responsible for old data, it just won't *know* about the old data it should be responsible for. Executing a repair will fix this, but only because the existing nodes will stream all the missing data to the new node. This will create more pressure on your cluster than just normal bootstrapping would have. I can't think of any reason you'd want to do that unless you needed to grow your cluster really quickly, and were ok with corrupting your old data. On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang bluefl...@gmail.com wrote: Hi there, I am using C* 2.0.10 and I was trying to add a new node to a cluster(actually replace a dead node). But after added the new node some other nodes in the cluster had a very high work-load and affected the whole performance of the cluster. So I am wondering is there a way to add a new node and this node only afford new data?
Re: get partition key from tombstone warnings?
Ah, thanks for the pointer Philip. Is there any kind of formal way to vote up issues? I'm assuming that adding a comment of +1 or the like is more likely to be *counter*productive. - Ian On Wed, Jan 21, 2015 at 5:02 PM, Philip Thompson philip.thomp...@datastax.com wrote: There is an open ticket for this improvement at https://issues.apache.org/jira/browse/CASSANDRA-8561 On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose ianr...@fullstory.com wrote: When I see a warning like Read 9 live and 5769 tombstoned cells in ... etc is there a way for me to see the partition key that this query was operating on? The description in the original JIRA ticket ( https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though exposing this information was one of the original goals, but it isn't obvious to me in the logs... Cheers! - Ian
Re: Versioning in cassandra while indexing ?
depending on your data model, static column night be useful. https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-6561 On Jan 21, 2015 2:56 AM, Pandian R pandian4m...@gmail.com wrote: Hi, I just wanted to know if there is any kind of versioning system in cassandra while indexing new data(like the one we have for ElasticSearch, for example). For example, I have a series of payloads each coming with an id and 'updatedAt' timestamp. I just want to maintain the latest state of any payload for all the ids ie, index the data only if the current payload has greater 'updatedAt' than the previously stored timestamp. I can do this with one additional self-lookup, but is there a way to achieve this without overhead of additional lookup ? Thanks ! -- Regards, Pandian
Re: Versioning in cassandra while indexing ?
I believe you can use “USING TIMESTAMP XXX” with your inserts which will set the actual cell write times to the timestamp you provide. Then at least on read you’ll get the “latest” value… you may or may not incur an actual write of the old data to disk, but either way it’ll get cleaned up for you. On Jan 21, 2015, at 1:54 AM, Pandian R pandian4m...@gmail.com wrote: Hi, I just wanted to know if there is any kind of versioning system in cassandra while indexing new data(like the one we have for ElasticSearch, for example). For example, I have a series of payloads each coming with an id and 'updatedAt' timestamp. I just want to maintain the latest state of any payload for all the ids ie, index the data only if the current payload has greater 'updatedAt' than the previously stored timestamp. I can do this with one additional self-lookup, but is there a way to achieve this without overhead of additional lookup ? Thanks ! -- Regards, Pandian smime.p7s Description: S/MIME cryptographic signature
Re: Versioning in cassandra while indexing ?
Awesome. Thanks a lot Graham. Will use the clock timestamp for versioning :) On Wed, Jan 21, 2015 at 2:02 PM, graham sanderson gra...@vast.com wrote: I believe you can use “USING TIMESTAMP XXX” with your inserts which will set the actual cell write times to the timestamp you provide. Then at least on read you’ll get the “latest” value… you may or may not incur an actual write of the old data to disk, but either way it’ll get cleaned up for you. On Jan 21, 2015, at 1:54 AM, Pandian R pandian4m...@gmail.com wrote: Hi, I just wanted to know if there is any kind of versioning system in cassandra while indexing new data(like the one we have for ElasticSearch, for example). For example, I have a series of payloads each coming with an id and 'updatedAt' timestamp. I just want to maintain the latest state of any payload for all the ids ie, index the data only if the current payload has greater 'updatedAt' than the previously stored timestamp. I can do this with one additional self-lookup, but is there a way to achieve this without overhead of additional lookup ? Thanks ! -- Regards, Pandian -- Regards, Pandian
Re: Re: Dynamic Columns
Hello, Peter highlighted the tradeoff between Thrift and CQL3 nicely in this case, i.e. requiring a different design approach for this solution. Collections do not sound like a good fit for your current challenge, but is there a different way to design/solve your challenge using CQL techniques? It is recommended to leverage CQL for new projects as this is the direction that Cassandra is heading and where the majority of effort is being applied from a development perspective. Sounds like you have a decision to make. Leverage Thrift and the Dynamic Column approach to solving this problem. Or, rethink the design approach and leverage CQL. Please let the mailing list know the direction you choose. Jonathan [image: datastax_logo.png] Jonathan Lacefield Solution Architect | (404) 822 3487 | jlacefi...@datastax.com [image: linkedin.png] http://www.linkedin.com/in/jlacefield/ [image: facebook.png] https://www.facebook.com/datastax [image: twitter.png] https://twitter.com/datastax [image: g+.png] https://plus.google.com/+Datastax/about http://feeds.feedburner.com/datastax https://github.com/datastax/ On Tue, Jan 20, 2015 at 9:46 PM, Peter Lin wool...@gmail.com wrote: the thing is, CQL only handles some types of dynamic column use cases. There's plenty of examples on datastax.com that shows how to do CQL style dynamic columns. based on what was described by Chetan, I don't feel CQL3 is a perfect fit for what he wants to do. To use CQL3, he'd have to change his approach. In my temporal database, I use both Thrift and CQL. They compliment each other very nice. I don't understand why people have to put down Thrift or pretend it supports 100% of the use cases. Lots of people who started using Cassandra pre CQL and had no problems using thrift. Yes you have to understand more and the learning curve is steeper, but taking time to learn the internals of cassandra is a good thing. Using CQL3 lists or maps, it would force the query to load the enter collection, but that is by design. To get the full power of the old style of dynamic columns, thrift is a better fit. I hope CQL continues to improve so that it supports 100% of the existing use cases. On Tue, Jan 20, 2015 at 8:50 PM, Xu Zhongxing xu_zhong_x...@163.com wrote: I approximate dynamic columns by data_key and data_value columns. Is there a better way to get dynamic columns in CQL 3? At 2015-01-21 09:41:02, Peter Lin wool...@gmail.com wrote: I think that table example misses the point of chetan's functional requirement. he actually needs dynamic columns. On Tue, Jan 20, 2015 at 8:12 PM, Xu Zhongxing xu_zhong_x...@163.com wrote: Maybe this is the closest thing to dynamic columns in CQL 3. create table reivew ( product_id bigint, created_at timestamp, data_key text, data_tvalue text, data_ivalue int, primary key ((priduct_id, created_at), data_key) ); data_tvalue and data_ivalue is optional. At 2015-01-21 04:44:07, chetan verma chetanverm...@gmail.com wrote: Hi, Adding to previous mail. For example: We have a column family named review (with some arbitrary data in map). CREATE TABLE review( product_id bigint, created_at timestamp, data_int maptext, int, data_text maptext, text, PRIMARY KEY (product_id, created_at) ); Assume that these 2 maps I use to store arbitrary data (i.e. data_int and data_text for int and text values) when we see output on cassandra-cli, it looks like in a partition as : clustering_key:data_int:map_key as column name and value as map value. suppose I need to get this value, I couldn't do that with CQL3 but in thrift its possible. Any Solution? On Wed, Jan 21, 2015 at 1:06 AM, chetan verma chetanverm...@gmail.com wrote: Hi, Most of the time I will be querying on product_id and created_at, but for analytic I need to query almost on all column. Multiple collections ideas is good but the only is cassandra reads a collection entirely, what if I need a slice of it, I mean columns for certain keys which is possible with thrift. Please suggest. On Wed, Jan 21, 2015 at 12:36 AM, Jonathan Lacefield jlacefi...@datastax.com wrote: Hello, There are probably lots of options to this challenge. The more details around your use case that you can provide, the easier it will be for this group to offer advice. A few follow-up questions: - How will you query this data? - Do your queries require filtering on specific columns other than product_id and created_at, i.e. the dynamic columns? Depending on the answers to these questions, you have several options, of which here are a few: - Cassandra efficiently stores sparse data, so you could create columns and not populate them, without much of a penalty - Could use a clustering column to store a columns type and another col (potentially clustering) to store the value - i.e. CREATE TABLE foo (col1 int, attname text, attvalue text, col4...n,
row cache hit is costlier for partiton with large rows
Hi, With two different families when I do a read, row cache hit is almost 15x costlier with larger rows (1 rows per partition), in comparison to partition with only 100 rows. Difference in two column families is one is having 100 rows per partition another 1 rows per partition. Schema for two tables is: CREATE TABLE table1_row_cache ( user_id uuid, dept_id uuid, location_id text, locationmap_id uuid, PRIMARY KEY ((user_id, location_id), dept_id) ) CREATE TABLE table2_row_cache ( user_id uuid, dept_id uuid, location_id text, locationmap_id uuid, PRIMARY KEY ((user_id, dept_id), location_id) ) Here is the tracing: Row cache Hit with Column Family table1_row_cache, 100 rows per partition: Preparing statement [SharedPool-Worker-2] | 2015-01-20 14:35:47.54 | x.x.x.x | 1023 Row cache hit [SharedPool-Worker-5] | 2015-01-20 14:35:47.542000 | x.x.x.x | 2426 Row cache Hit with CF table2_row_cache, 1 rows per partition: Preparing statement [SharedPool-Worker-1] | 2015-01-20 16:02:51.696000 | x.x.x.x |490 Row cache hit [SharedPool-Worker-2] | 2015-01-20 16:02:51.711000 | x.x.x.x | 15146 If for both cases data is in memory why its not same? Can someone point me what wrong here? Nitin Padalia
Re: keyspace not exists?
Thanks Rob, we keep this in mind for our learning journey. Jason On Wed, Jan 21, 2015 at 6:45 AM, Robert Coli rc...@eventbrite.com wrote: On Sun, Jan 18, 2015 at 8:55 PM, Jason Wee peich...@gmail.com wrote: two nodes running cassandra 2.1.2 and one running cassandra 2.1.1 For the record, this is an unsupported persistent configuration. You are only supposed to have split minor versions during an upgrade. I have no idea if it is causing the problem you are having. =Rob
cassandra-stress - confusing documentation
Hi all I'm using cassandra-stress directly from apache-cassandra-2.1.2/tools/bin The documentation I found http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCStress_t.html is either too old or too advance, but does not match what I use. In particular, I fail to use the -key populate=1..100 option as used in the two nodes example from the link above. #On Node1$ cassandra-stress write tries=20 n=100 cl=one -mode native cql3 -schema keyspace=Keyspace1 -key populate=1..100 -log file=~/node1_load.log -node $NODES #On Node2$ cassandra-stress write tries=20 n=100 cl=one -mode native cql3 -schema keyspace=Keyspace1 -key populate=101..200 -log file=~/node2_load.log -node $NODES Can some one please direct me to the right doc, or to a valid example of using populate range? Thanks Tzach
Re: Re: Dynamic Columns
I've studied the source code and I don't believe that statement is true. I've chatted with several long time users of Cassandra and there's things CQL3 doesn't support. Like I've said before. Thrift and CQL3 compliment each other. I totally understand some committers don't want the overhead due to time and resource limitations. On more than one occassion, people have offered to help and work on thrift, but were rejected. There's logs in jira. For the record, it's great that CQL was created to make life easier for new users. But here's the thing that annoys me. There's users that just want to save and query data, but there's people out there like me that are building tools for Cassandra. For tool builders, having object API like thrift is invaluable. If we look at relational databases, we see many of them have 2 separate API for that reason. Microsoft SqlServer has SQL and object API. Having both makes it easier to build tools. It's a shame to ignore all the lessons RDBMS can teach us and suffer NIH syndrome. I've built several data modeling tools over the years including ORM's. We built our own data modeling tool for the temporal database I built on Cassandra, so this isn't just some hypothetical complaint. This is from many years of first hand experience. I understand my needs often don't and won't line up with what's in Cassandra's roadmap. But that's the great thing about open source. Should thrift go away permanently I'll just fork Cassandra and do my own thing. On Wed, Jan 21, 2015 at 8:53 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Wed, Jan 21, 2015 at 3:46 AM, Peter Lin wool...@gmail.com wrote: I don't understand why people [...] pretend it supports 100% of the use cases. Have you consider the possibly that it's actually true and you're just wrong by lack of knowledge? -- Sylvain
Re: Count of column values
Hi , Sorry for the previous incomplete message. I am using where clause as follows: select count(*) from trends where data1='abc' ALLOW FILTERING; How can i store this count output to any other column. Can you help with any wayround. Thanks, Poonam. On Wed, Jan 21, 2015 at 7:46 PM, Poonam Ligade poonam.v.lig...@gmail.com wrote: Hi, I am newbie to Cassandra. I have to find out top 10 recent trends in data I have schema as follows create table trends( day int, data1 text, data2 mapint, decimal, PRIMARY KEY (day, data1)) ; I have to take count of duplicate values in data1 so that i can find top 10 data1 trends. 1. I tried adding an counter column, but again you can't use order by clause on counter column. 2. I tried using where clause
Re: Re: Dynamic Columns
On Wed, Jan 21, 2015 at 3:46 AM, Peter Lin wool...@gmail.com wrote: I don't understand why people [...] pretend it supports 100% of the use cases. Have you consider the possibly that it's actually true and you're just wrong by lack of knowledge? -- Sylvain
Re: row cache hit is costlier for partiton with large rows
The row cache saves partition data off-heap, which means that every cache hit require copying/deserializing the cached partition into the heap, and the more rows per partition you cache, the long it will take. Which is why it's currently not a good cache too much rows per partition (unless you know what you're doing). On Wed, Jan 21, 2015 at 1:15 PM, nitin padalia padalia.ni...@gmail.com wrote: Hi, With two different families when I do a read, row cache hit is almost 15x costlier with larger rows (1 rows per partition), in comparison to partition with only 100 rows. Difference in two column families is one is having 100 rows per partition another 1 rows per partition. Schema for two tables is: CREATE TABLE table1_row_cache ( user_id uuid, dept_id uuid, location_id text, locationmap_id uuid, PRIMARY KEY ((user_id, location_id), dept_id) ) CREATE TABLE table2_row_cache ( user_id uuid, dept_id uuid, location_id text, locationmap_id uuid, PRIMARY KEY ((user_id, dept_id), location_id) ) Here is the tracing: Row cache Hit with Column Family table1_row_cache, 100 rows per partition: Preparing statement [SharedPool-Worker-2] | 2015-01-20 14:35:47.54 | x.x.x.x | 1023 Row cache hit [SharedPool-Worker-5] | 2015-01-20 14:35:47.542000 | x.x.x.x | 2426 Row cache Hit with CF table2_row_cache, 1 rows per partition: Preparing statement [SharedPool-Worker-1] | 2015-01-20 16:02:51.696000 | x.x.x.x |490 Row cache hit [SharedPool-Worker-2] | 2015-01-20 16:02:51.711000 | x.x.x.x | 15146 If for both cases data is in memory why its not same? Can someone point me what wrong here? Nitin Padalia
Count of column values
Hi, I am newbie to Cassandra. I have to find out top 10 recent trends in data I have schema as follows create table trends( day int, data1 text, data2 mapint, decimal, PRIMARY KEY (day, data1)) ; I have to take count of duplicate values in data1 so that i can find top 10 data1 trends. 1. I tried adding an counter column, but again you can't use order by clause on counter column. 2. I tried using where clause
Re: Re: Dynamic Columns
I've chatted with several long time users of Cassandra and there's things CQL3 doesn't support. Would you care to elaborate then? Maybe a simple example of something (or multiple things since you used plural) in thrift that cannot be supported in CQL? And please note that I'm *not* saying that all existing thrift table can be seemlessly used from CQL: there is indeed a few cases for which that's not the case. But that does not mean those cases cannot easily be in CQL from scratch.