Re: How to know disk utilization by each row on a node

2015-01-21 Thread nitin padalia
Did you use cfstats and cfhistograms?
On Jan 22, 2015 12:37 AM, Edson Marquezani Filho 
edsonmarquez...@gmail.com wrote:

 Ok, nice tool, but I still can't see how much data each row occupies
 on the SSTable (or am I missing something?).

 Obs: considering SSTables format, where rows are strictly sequential
 and sorted, a feature like that doesn't seem something very hard to
 implement, anyway. Wouldn't it be possible to calculate it only from
 index files, without even needing to read the actual table?

 On Tue, Jan 20, 2015 at 5:05 PM, Jens Rantil jens.ran...@tink.se wrote:
  Hi,
 
  Datastax comes with sstablekeys that does that. You could also use
  sstable2json script to find keys.
 
  Cheers,
  Jens
 
 
 
  On Tue, Jan 20, 2015 at 2:53 PM, Edson Marquezani Filho
  edsonmarquez...@gmail.com wrote:
 
  Hello, everybody.
 
  Does anyone know a way to list, for an arbitrary column family, all
  the rows owned (including replicas) by a given node and the data size
  (real size or disk occupation) of each one of them on that node?
 
  I would like to do that because I have data on one of my nodes growing
  faster than the others, although rows (and replicas) seem evenly
  distributed across the cluster. So, I would like to verify if I have
  some specific rows growing too much.
 
  Thank you.
 
 



Re: How to know disk utilization by each row on a node

2015-01-21 Thread Edson Marquezani Filho
Ok, nice tool, but I still can't see how much data each row occupies
on the SSTable (or am I missing something?).

Obs: considering SSTables format, where rows are strictly sequential
and sorted, a feature like that doesn't seem something very hard to
implement, anyway. Wouldn't it be possible to calculate it only from
index files, without even needing to read the actual table?

On Tue, Jan 20, 2015 at 5:05 PM, Jens Rantil jens.ran...@tink.se wrote:
 Hi,

 Datastax comes with sstablekeys that does that. You could also use
 sstable2json script to find keys.

 Cheers,
 Jens



 On Tue, Jan 20, 2015 at 2:53 PM, Edson Marquezani Filho
 edsonmarquez...@gmail.com wrote:

 Hello, everybody.

 Does anyone know a way to list, for an arbitrary column family, all
 the rows owned (including replicas) by a given node and the data size
 (real size or disk occupation) of each one of them on that node?

 I would like to do that because I have data on one of my nodes growing
 faster than the others, although rows (and replicas) seem evenly
 distributed across the cluster. So, I would like to verify if I have
 some specific rows growing too much.

 Thank you.




Re: Compaction failing to trigger

2015-01-21 Thread Robert Coli
On Wed, Jan 21, 2015 at 10:10 AM, Flavien Charlon flavien.char...@gmail.com
 wrote:

 https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/



 This doesn't really answer my question, I asked whether this particular
 bug (which I can't find in JIRA) is planned to be fixed in 2.1.3, not
 whether 2.1.3 would be production ready.


No idea, but as I didn't recognize your name/email and you were
encountering problems with an IMO not-ready-for-production version. Many
people who are new to Cassandra and pre-or-close-to-production might be
better served by running a slightly older version and focusing on the
challenge of writing their app against a mostly-working distributed
database instead of troubleshooting Cassandra bugs.

tl;dr - Cassandra bugs in cutting edge versions are likely best encountered
by experienced operators who can recognize them and respond, not new
operators.

While we're on this topic, the version numbering is very misleading.
 Version which are not recommended for production should be very explicitly
 labelled as such (beta for example), and 2.1.0 should really be what you
 call now 2.1.6.


That's why I wrote the blog post. It is however important to note that I
speak in no official capacity for Apache Cassandra or Datastax.

The intent of the project is for x.y.0 to be production ready, and in
fairness they have recently added new QA processes which are likely to
drive the production ready version down from x.y.6. They are only human,
however, and as human developers are likely have slightly different (lower)
standards for production readiness than the typical operator. I wrote that
blog post to help set operator-appropriate expectations, so people are not
disappointed with the overall stability of Cassandra.

I personally operate Cassandra slightly on the trailing edge, and as a
result only encounter a limited subset of the problems I assist people with
on the list and IRC.

=Rob


Re: Compaction failing to trigger

2015-01-21 Thread Flavien Charlon

 What version of Cassandra are you running?


2.1.2

Are they all live? Are there pending compactions, or exceptions regarding
 compactions in your logs?


Yes they are all live according to cfstats. There is no pending compaction
or exception in the logs.

https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/


This doesn't really answer my question, I asked whether this particular bug
(which I can't find in JIRA) is planned to be fixed in 2.1.3, not whether
2.1.3 would be production ready.

While we're on this topic, the version numbering is very misleading.
Version which are not recommended for production should be very explicitly
labelled as such (beta for example), and 2.1.0 should really be what you
call now 2.1.6.

Setting 'cold_reads_to_omit' to 0 did the job for me


Thanks, I've tried it, and it works. This should probably be made the
default IMO.

Flavien


On 20 January 2015 at 22:51, Eric Stevens migh...@gmail.com wrote:

 @Rob - he's probably referring to the thread titled Reasons for nodes not
 compacting? where Tyler speculates that the tables are falling below the
 cold read threshold for compaction.  He speculated it may be a bug.  At the
 same time in a different thread, Roland had a similar problem, and Tyler's
 proposed workaround seemed to work for him.

 On Tue, Jan 20, 2015 at 3:35 PM, Robert Coli rc...@eventbrite.com wrote:

 On Sun, Jan 18, 2015 at 6:06 PM, Flavien Charlon 
 flavien.char...@gmail.com wrote:

 It's set on all the tables, as I'm using the default for all the tables.
 But for that particular table there are 41 SSTables between 60MB and 85MB,
 it should only take 4 for the compaction to kick in.


 What version of Cassandra are you running?

 Are they all live? Are there pending compactions, or exceptions
 regarding compactions in your logs?


 As this is probably a bug and going back in the mailing list archive, it
 seems it's already been reported:


 This is a weird statement. Are you saying that you've found it in the
 mailing list archives? If so, why not paste the threads so those of us who
 might remember can refer to them?


- Will it be fixed in 2.1.3?


 https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/


 =Rob





Re: How do replica become out of sync

2015-01-21 Thread Flavien Charlon
Quite a few, see here: http://pastebin.com/SMnprHdp. In total about 3,000
ranges across the 3 nodes.

This is with vnodes disabled. It was at least an order of magnitude worse
when we had it enabled.

Flavien

On 20 January 2015 at 22:22, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Jan 19, 2015 at 5:44 PM, Flavien Charlon 
 flavien.char...@gmail.com wrote:

 Thanks Andi. The reason I was asking is that even though my nodes have
 been 100% available and no write has been rejected, when running an
 incremental repair, the logs still indicate that some ranges are out of
 sync (which then results in large amounts of compaction), how can this be
 possible?


 This is most likely, as you conjecture, due to slight differences between
 nodes at the time of Merkle Tree calculation.

 How many rows differ?

 =Rob




Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
the example you provided does not work for for my use case.

  CREATE TABLE t (
key blob,
static my_static_column_1 int,
static my_static_column_2 float,
static my_static_column_3 blob,
,
dynamic_column_name blob,
dynamic_column_value blob,
PRIMARY KEY (key, dynamic_column_name);
  )

the dynamic column can't be part of the primary key. The temporal entity
key can be the default UUID or the user can choose the field in their
object. Within our framework, we have concept of temporal links between one
or more temporal entities. Poluting the primary key with the dynamic column
wouldn't work.

Please excuse the confusing RDB comparison. My point is that Cassandra's
dynamic column feature is the unique feature that makes it better than
traditional RDB or newSql like VoltDB for building temporal databases. With
databases that require static schema + alter table for managing schema
evolution, it makes it harder and results in down time.

One of the challenges of data management over time is evolving the data
model and making queries simple. If the record is 5 years old, it probably
has a difference schema than a record inserted this week. With temporal
databases, every update is an insert, so it's a little bit more complex
than just use a blob. There's a whole level of complication with temporal
data and CQL3 custom types isn't clear to me. I've read the CQL3
documentation on the custom types several times and it is rather poor. It
gives me the impression there's still work needed to get custom types in
good shape.

With regard to examples others have told me, your advice is fair. A few
minutes with google and some blogs should pop up. The reason I bring these
things up isn't to put down CQL. It's because I care and want to help
improve Cassandra by sharing my experience. I consistently recommend new
users learn and understand both Thrift and CQL.



On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne sylv...@datastax.com
wrote:

 On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin wool...@gmail.com wrote:

 I don't remember other people's examples in detail due to my shitty
 memory, so I'd rather not misquote.


 Fair enough, but maybe you shouldn't use people's examples you don't
 remenber as argument then. Those examples might be wrong or outdated and
 that kind of stuff creates confusion for everyone.



 In my case, I mix static and dynamic columns in a single column family
 with primitives and objects. The objects are temporal object graphs with a
 known type. Doing this type of stuff is basically transparent for me, since
 I'm using thrift and our data modeler generates helper classes. Our tooling
 seamlessly convert the bytes back to the target object. We have a few
 standard static columns related to temporal metadata. At any time, dynamic
 columns can be added and they can be primitives or objects.


 I don't see anything in that that cannot be done with CQL. You can mix
 static and dynamic columns in CQL thanks to static columns. More precisely,
 you can do what you're describing with a table looking a bit like this:
   CREATE TABLE t (
 key blob,
 static my_static_column_1 int,
 static my_static_column_2 float,
 static my_static_column_3 blob,
 ,
 dynamic_column_name blob,
 dynamic_column_value blob,
 PRIMARY KEY (key, dynamic_column_name);
   )

 And your helper classes will serialize your objects as they probably do
 today (if you use a custom comparator, you can do that too). And let it be
 clear that I'm not pretending that doing it this way is tremendously
 simpler than thrift. But I'm saying that 1) it's possible and 2) while it's
 not meaningfully simpler than thriftMy , it's not really harder either (and
 in fact, it's actually less verbose with CQL than with raw thrift).



 For the record, doing this kind of stuff in a relational database sucks
 horribly.


 I don't know what that has to do with CQL to be honest. If you're doing
 relational with CQL you're doing it wrong. And please note that I'm not
 saying CQL is the perfect API for modeling temporal data. But I don't get
 how thrift, which is very crude API, is a much better API at that than CQL
 (or, again, how it allows you to do things you can't with CQL).

 --
 Sylvain



Re: Re: Dynamic Columns

2015-01-21 Thread Sylvain Lebresne
On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin wool...@gmail.com wrote:

 I don't remember other people's examples in detail due to my shitty
 memory, so I'd rather not misquote.


Fair enough, but maybe you shouldn't use people's examples you don't
remenber as argument then. Those examples might be wrong or outdated and
that kind of stuff creates confusion for everyone.



 In my case, I mix static and dynamic columns in a single column family
 with primitives and objects. The objects are temporal object graphs with a
 known type. Doing this type of stuff is basically transparent for me, since
 I'm using thrift and our data modeler generates helper classes. Our tooling
 seamlessly convert the bytes back to the target object. We have a few
 standard static columns related to temporal metadata. At any time, dynamic
 columns can be added and they can be primitives or objects.


I don't see anything in that that cannot be done with CQL. You can mix
static and dynamic columns in CQL thanks to static columns. More precisely,
you can do what you're describing with a table looking a bit like this:
  CREATE TABLE t (
key blob,
static my_static_column_1 int,
static my_static_column_2 float,
static my_static_column_3 blob,
,
dynamic_column_name blob,
dynamic_column_value blob,
PRIMARY KEY (key, dynamic_column_name);
  )

And your helper classes will serialize your objects as they probably do
today (if you use a custom comparator, you can do that too). And let it be
clear that I'm not pretending that doing it this way is tremendously
simpler than thrift. But I'm saying that 1) it's possible and 2) while it's
not meaningfully simpler than thriftMy , it's not really harder either (and
in fact, it's actually less verbose with CQL than with raw thrift).



 For the record, doing this kind of stuff in a relational database sucks
 horribly.


I don't know what that has to do with CQL to be honest. If you're doing
relational with CQL you're doing it wrong. And please note that I'm not
saying CQL is the perfect API for modeling temporal data. But I don't get
how thrift, which is very crude API, is a much better API at that than CQL
(or, again, how it allows you to do things you can't with CQL).

--
Sylvain


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
I don't remember other people's examples in detail due to my shitty memory,
so I'd rather not misquote.

In my case, I mix static and dynamic columns in a single column family with
primitives and objects. The objects are temporal object graphs with a known
type. Doing this type of stuff is basically transparent for me, since I'm
using thrift and our data modeler generates helper classes. Our tooling
seamlessly convert the bytes back to the target object. We have a few
standard static columns related to temporal metadata. At any time, dynamic
columns can be added and they can be primitives or objects. The framework
we built uses CQL for basic queries and views the user defines.

We model the schema in a GUI modeler and the framework provides a query API
to access a specific version or versions of any record. The design borrows
heavily from temporal logic and active databases.

For the record, doing this kind of stuff in a relational database sucks
horribly. The reason I chose to build a temporal database on Cassandra is
because I've done it on oracle/sqlserver in the past. Last year I submitted
a talk about our temporal database for the datastax conference, but it was
rejected since there were too many submissions. I know spotify also built a
temporal database on Cassandra and they gave a talk on what they did.

peter


On Wed, Jan 21, 2015 at 10:13 AM, Sylvain Lebresne sylv...@datastax.com
wrote:


 I've chatted with several long time users of Cassandra and there's things
 CQL3 doesn't support.


 Would you care to elaborate then? Maybe a simple example of something (or
 multiple things since you used plural) in thrift that cannot be supported
 in CQL?
 And please note that I'm *not* saying that all existing thrift table can
 be seemlessly used from CQL: there is indeed a few cases for which that's
 not the case. But that does not mean those cases cannot easily be in CQL
 from scratch.



Re: get partition key from tombstone warnings?

2015-01-21 Thread Philip Thompson
There is an open ticket for this improvement at
https://issues.apache.org/jira/browse/CASSANDRA-8561

On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose ianr...@fullstory.com wrote:

 When I see a warning like Read 9 live and 5769 tombstoned cells in ...
 etc is there a way for me to see the partition key that this query was
 operating on?

 The description in the original JIRA ticket (
 https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
 exposing this information was one of the original goals, but it isn't
 obvious to me in the logs...

 Cheers!
 - Ian




Re: Re: Dynamic Columns

2015-01-21 Thread Sylvain Lebresne
On Wed, Jan 21, 2015 at 6:19 PM, Peter Lin wool...@gmail.com wrote:

 the dynamic column can't be part of the primary key. The temporal entity
 key can be the default UUID or the user can choose the field in their
 object. Within our framework, we have concept of temporal links between one
 or more temporal entities. Poluting the primary key with the dynamic column
 wouldn't work.


Not totally sure I understand. Are you talking about the underlying storage
space used? If you are, we can discuss it (it's not too hard to remedy it
in CQL, I was mainly trying to illustrating my point, not pretending this
was a drop-in solution for your use case) but it's more of a performance
discussion, and I think we've somewhat quit the realm of there's things
CQL3 doesn't support.


 Please excuse the confusing RDB comparison. My point is that Cassandra's
 dynamic column feature is the unique feature that makes it better than
 traditional RDB or newSql like VoltDB for building temporal databases. With
 databases that require static schema + alter table for managing schema
 evolution, it makes it harder and results in down time.


Here again you seem you imply that CQL doesn't support dynamic columns, or
has a somewhat inferior support, but that's just not true.


 One of the challenges of data management over time is evolving the data
 model and making queries simple. If the record is 5 years old, it probably
 has a difference schema than a record inserted this week. With temporal
 databases, every update is an insert, so it's a little bit more complex
 than just use a blob. There's a whole level of complication with temporal
 data and CQL3 custom types isn't clear to me. I've read the CQL3
 documentation on the custom types several times and it is rather poor. It
 gives me the impression there's still work needed to get custom types in
 good shape.


I'm sorry but that's a bit of hand waving. Custom types (and by that I mean
user-provided AbstractType implementations) works in CQL *exactly* like in
thrift: they are not in a better or worse shape than in thrift. And while
the documentation on CQL3 is indeed poor on this part, so is the thrift
documentation on the same subject (besides, I don't think you're whole
point is about saying that documentation could be improved). Again, what
you can do in thrift, you can do in CQL.


 I consistently recommend new users learn and understand both Thrift and
 CQL.


I understand that you do this with the best of intentions and don't take it
the wrong way but it is my opinion that you are counterproductive by doing
so, and this for 2 reasons:
1) you don't only recommend users to learn both API, you justify that
advice by affirming that there is a whole family of important use cases
that thrift supports and CQL do not. Except that I pretend tat this
affirmation is technically incorrect, and so far I haven't seen much
example proving me wrong.
2) there is a wealth of evidence that trying to learn both thrift and CQL
confuses the hell out of new users. Which is btw not surprising, both API
presents the same concepts in seemingly different way (even though they do
are the same concepts) and even have conflicting vocabulary, so it's
obviously confusing when you try to learn those concepts in the first
place. Trying to learn CQL when you know thrift well is fine, and why not
learn thrift once you know and understand CQL well, but learning both is
imo a bad advice. It could maybe (maybe) be justified if what you say about
having whole family of use cases not being doable with CQL was true, but
it's not.

--
Sylvain





 On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:

 On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin wool...@gmail.com wrote:

 I don't remember other people's examples in detail due to my shitty
 memory, so I'd rather not misquote.


 Fair enough, but maybe you shouldn't use people's examples you don't
 remenber as argument then. Those examples might be wrong or outdated and
 that kind of stuff creates confusion for everyone.



 In my case, I mix static and dynamic columns in a single column family
 with primitives and objects. The objects are temporal object graphs with a
 known type. Doing this type of stuff is basically transparent for me, since
 I'm using thrift and our data modeler generates helper classes. Our tooling
 seamlessly convert the bytes back to the target object. We have a few
 standard static columns related to temporal metadata. At any time, dynamic
 columns can be added and they can be primitives or objects.


 I don't see anything in that that cannot be done with CQL. You can mix
 static and dynamic columns in CQL thanks to static columns. More precisely,
 you can do what you're describing with a table looking a bit like this:
   CREATE TABLE t (
 key blob,
 static my_static_column_1 int,
 static my_static_column_2 float,
 static my_static_column_3 blob,
 ,
 dynamic_column_name blob,
 

get partition key from tombstone warnings?

2015-01-21 Thread Ian Rose
When I see a warning like Read 9 live and 5769 tombstoned cells in ...
etc is there a way for me to see the partition key that this query was
operating on?

The description in the original JIRA ticket (
https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
exposing this information was one of the original goals, but it isn't
obvious to me in the logs...

Cheers!
- Ian


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
I apologize if I've offended you, but I clearly stated CQL3 supports
dynamic columns. How it supports dynamic columns is different. If I'm
reading you correctly, I believe we agree both thrift and CQL3 support
dynamic columns. Where we differ that I feel the coverage for existing
thrift use cases isn't 100%. That may be right or wrong, but it is my
impression. I agree with you that CQL3 supports the majority of dynamic
column use cases, but in a slightly different way. There are cases like
mine which fit better in thrift.

Could I rip out all the stuff I did and replace it with CQL3 with a major
redesign? Yes, I could but honestly I see some downsides with that
proposition.

1. for modeling tools like mine an object API is a far better fit in my
bias opinion
2. text based languages like SQL and CQL could in theory provide similar
object safety, but it's so much work that most people don't bother. This is
from first hand experience building 3 orms and using most of the open
source orms in the java space. I've also used several orms in .Net and they
all suffer from this pain point. There's a reason why microsoft created
Linq.
3. the structure and syntax of SQL  and all variations of SQL are not
ideally suited to complex data structures that are graphs. A temporal
entity is an object graph that may be shallow (3-8 levels) or deep (15+).
SQL is ideally suited to tables. CQL in this regard is more flexible and
supports collections, but it's still not ideal for things like insurance
policies. Look at the Acord standard for property insurance, if you want to
get a better understanding. For example, a temporal record using ORM could
result in 500 rows of data in a dozen tables for a small entity to 50K+
rows for a large entity. The mailing list isn't the right place to go into
the theory and practice of temporal databases, but a lot of the design
choices I made is based on formal logic.



On Wed, Jan 21, 2015 at 4:06 PM, Sylvain Lebresne sylv...@datastax.com
wrote:

 On Wed, Jan 21, 2015 at 6:19 PM, Peter Lin wool...@gmail.com wrote:

 the dynamic column can't be part of the primary key. The temporal entity
 key can be the default UUID or the user can choose the field in their
 object. Within our framework, we have concept of temporal links between one
 or more temporal entities. Poluting the primary key with the dynamic column
 wouldn't work.


 Not totally sure I understand. Are you talking about the underlying
 storage space used? If you are, we can discuss it (it's not too hard to
 remedy it in CQL, I was mainly trying to illustrating my point, not
 pretending this was a drop-in solution for your use case) but it's more of
 a performance discussion, and I think we've somewhat quit the realm of
 there's things CQL3 doesn't support.


 Please excuse the confusing RDB comparison. My point is that Cassandra's
 dynamic column feature is the unique feature that makes it better than
 traditional RDB or newSql like VoltDB for building temporal databases. With
 databases that require static schema + alter table for managing schema
 evolution, it makes it harder and results in down time.


 Here again you seem you imply that CQL doesn't support dynamic columns, or
 has a somewhat inferior support, but that's just not true.


 One of the challenges of data management over time is evolving the data
 model and making queries simple. If the record is 5 years old, it probably
 has a difference schema than a record inserted this week. With temporal
 databases, every update is an insert, so it's a little bit more complex
 than just use a blob. There's a whole level of complication with temporal
 data and CQL3 custom types isn't clear to me. I've read the CQL3
 documentation on the custom types several times and it is rather poor. It
 gives me the impression there's still work needed to get custom types in
 good shape.


 I'm sorry but that's a bit of hand waving. Custom types (and by that I
 mean user-provided AbstractType implementations) works in CQL *exactly*
 like in thrift: they are not in a better or worse shape than in thrift. And
 while the documentation on CQL3 is indeed poor on this part, so is the
 thrift documentation on the same subject (besides, I don't think you're
 whole point is about saying that documentation could be improved). Again,
 what you can do in thrift, you can do in CQL.


Honestly I haven't I tried to use CQL3 user provided type. I read the
specification several times and had a ton of questions along with several
other people that were trying to under what it meant. If you want people to
use it, the documentation needs to improve. I did give a good faith effort
and spent a week trying to understand what the spec is trying to say, but
it only resulted in more questions. So yes, I am hand waving because it
left me frustrated. Having been part of apache community for many years,
writing great docs is hard and most of us hate doing it. Just to be clear,
I'm not blaming anyone for poor docs. I'm just 

Re: Re: Dynamic Columns

2015-01-21 Thread Robert Coli
On Wed, Jan 21, 2015 at 9:19 AM, Peter Lin wool...@gmail.com wrote:


 I consistently recommend new users learn and understand both Thrift and
 CQL.


FWIW, I consider this a disservice to new users. New users should use CQL,
and not deploy against a deprecated-in-all-but-name API. Understanding
non-CQL *storage* might be valuable, understanding the Thrift interface to
storage is anti-valuable.

Despite the dissembling public statements regarding Thrift not going
anywhere it is obvious to me that no other databases exist with two
non-pluggable and incompatible APIs for a reason. The pain of maintaining
these two APIs will eventually become not worth the backwards
compatibility. At this time it will be deprecated and then shortly
thereafter removed; I expect this to happen at latest by EOY 2018. [1]

=Rob
[1] If anyone strongly disagrees, I am taking $20 cash bets, with any
proceeds donated to the Apache Foundation.


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
everyone is different. I also recommend users take time to understanding
every tool they use as much as time allows. We don't always have the luxury
of time, but I see no point recommending laziness.

I'm probably insane, since I also spend time reading papers on CRDT, paxos,
query compilers, machine learning and other topics I find fun.

on the topic of multiple incompatible API's I recommend you look at
SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible
API. Though in some cases, it is/was unavoidable.

On Wed, Jan 21, 2015 at 4:47 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jan 21, 2015 at 9:19 AM, Peter Lin wool...@gmail.com wrote:


 I consistently recommend new users learn and understand both Thrift and
 CQL.


 FWIW, I consider this a disservice to new users. New users should use CQL,
 and not deploy against a deprecated-in-all-but-name API. Understanding
 non-CQL *storage* might be valuable, understanding the Thrift interface to
 storage is anti-valuable.

 Despite the dissembling public statements regarding Thrift not going
 anywhere it is obvious to me that no other databases exist with two
 non-pluggable and incompatible APIs for a reason. The pain of maintaining
 these two APIs will eventually become not worth the backwards
 compatibility. At this time it will be deprecated and then shortly
 thereafter removed; I expect this to happen at latest by EOY 2018. [1]

 =Rob
 [1] If anyone strongly disagrees, I am taking $20 cash bets, with any
 proceeds donated to the Apache Foundation.




Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
I've written my fair share of crappy code, which became legacy. then I or
someone else was left with supporting it and something newer. Isn't that
the nature of software development.

I forget who said this quote first, but I'm gonna borrow it only pretty
code is code that is in your head. once it's written, it becomes crap. I
tell my son this all the time. When we start a project we have no clue what
we should have known, so we make a butt load of mistakes. If we're lucky,
by the third or forth version it's not so smelly, but in the mean time we
have to keep supporting the stuff. Not because we want to, but because
we're the ones that put the users through it. Atleast that's how I see it.

having said that, at some point, the really old stuff should be deprecated
and cleaned out. It totally makes sense to remove thrift at some point. I
don't know when that is, but every piece of software eventually dies or is
abandoned. Except for Cobol. That thing will be around 200 yrs from now



On Wed, Jan 21, 2015 at 6:57 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jan 21, 2015 at 2:09 PM, Peter Lin wool...@gmail.com wrote:

 on the topic of multiple incompatible API's I recommend you look at
 SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible
 API. Though in some cases, it is/was unavoidable.


 My bet is that the small development team responsible for Cassandra does
 not have anything like the number of contractual obligations that
 commercial databases from the 1980s had. In other words, I believe having
 two persistent, non-pluggable (this attribute probably excludes various
 legacy APIs?) APIs is far more avoidable in the Cassandra case than in
 the historic cases you cite. I could certainly be wrong... people who
 disagree with my assessment now have a way to make me pay for my wrongness
 by making me donate $20 to the Apache Foundation on Jan 1, 2019. [1] :D

 =Rob
 [1] Project committers/others with material ability (Datastax...) to
 affect outcome ineligible.




Re: Re: Dynamic Columns

2015-01-21 Thread Robert Coli
On Wed, Jan 21, 2015 at 2:09 PM, Peter Lin wool...@gmail.com wrote:

 on the topic of multiple incompatible API's I recommend you look at
 SqlServer and Sybase. Most of the legacy RDBMS have multiple incompatible
 API. Though in some cases, it is/was unavoidable.


My bet is that the small development team responsible for Cassandra does
not have anything like the number of contractual obligations that
commercial databases from the 1980s had. In other words, I believe having
two persistent, non-pluggable (this attribute probably excludes various
legacy APIs?) APIs is far more avoidable in the Cassandra case than in
the historic cases you cite. I could certainly be wrong... people who
disagree with my assessment now have a way to make me pay for my wrongness
by making me donate $20 to the Apache Foundation on Jan 1, 2019. [1] :D

=Rob
[1] Project committers/others with material ability (Datastax...) to affect
outcome ineligible.


Re: Re: Dynamic Columns

2015-01-21 Thread Jack Krupansky
Peter,

At least from your description, the proposed use of the clustering column
name seems at first blush to fully fit the bill. The point is not that the
resulting clustered primary key is used to reference an object, but that a
SELECT on the partition key references the entire object, which will be a
sequence of CQL3 rows in a partition, and then the clustering column key is
added when you wish to access that specific aspect of the object. What's
missing? Again, just store the partition key to reference the full object -
no pollution required!

And please note that any number of clustering columns can be specified, so
more structured dynamic columns can be supported. For example, you could
have a timestamp as a separate clustering column to maintain temporal state
of the database. The partition key can also be structured from multiple
columns as a composite partition key as well.

As far as all these static columns, consider them optional and merely an
optimization. If you wish to have a 100% opaque object model, you wouldn't
have any static columns and the only non-primary key column would be the
blob value field. Every object attribute would be specified using another
clustering column name and blob value. Presto, everything you need for a
pure, opaque, fully-generalized object management system - all with just
CQL3. Maybe we should include such an example in the doc and with the
project to more strongly emphasize this capability to fully model
arbitrarily complex object structures - including temporal structures.

Anything else missing?

As a general proposition, you can use the term clustering column in CQL3
wherever you might have used dynamic column in Thrift. The point in CQL3
is not to eliminate a useful feature, dynamic column, but to repackage the
feature to make a lot more sense for the vast majority of use cases. Maybe
there are some cases that doesn't exactly fit as well as desired, but feel
free to specifically identify such cases so that we can elaborate how we
think they are covered or at least covered well enough for most users.


-- Jack Krupansky

On Wed, Jan 21, 2015 at 12:19 PM, Peter Lin wool...@gmail.com wrote:


 the example you provided does not work for for my use case.

   CREATE TABLE t (
 key blob,
 static my_static_column_1 int,
 static my_static_column_2 float,
 static my_static_column_3 blob,
 ,
 dynamic_column_name blob,
 dynamic_column_value blob,
 PRIMARY KEY (key, dynamic_column_name);
   )

 the dynamic column can't be part of the primary key. The temporal entity
 key can be the default UUID or the user can choose the field in their
 object. Within our framework, we have concept of temporal links between one
 or more temporal entities. Poluting the primary key with the dynamic column
 wouldn't work.

 Please excuse the confusing RDB comparison. My point is that Cassandra's
 dynamic column feature is the unique feature that makes it better than
 traditional RDB or newSql like VoltDB for building temporal databases. With
 databases that require static schema + alter table for managing schema
 evolution, it makes it harder and results in down time.

 One of the challenges of data management over time is evolving the data
 model and making queries simple. If the record is 5 years old, it probably
 has a difference schema than a record inserted this week. With temporal
 databases, every update is an insert, so it's a little bit more complex
 than just use a blob. There's a whole level of complication with temporal
 data and CQL3 custom types isn't clear to me. I've read the CQL3
 documentation on the custom types several times and it is rather poor. It
 gives me the impression there's still work needed to get custom types in
 good shape.

 With regard to examples others have told me, your advice is fair. A few
 minutes with google and some blogs should pop up. The reason I bring these
 things up isn't to put down CQL. It's because I care and want to help
 improve Cassandra by sharing my experience. I consistently recommend new
 users learn and understand both Thrift and CQL.



 On Wed, Jan 21, 2015 at 11:45 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:

 On Wed, Jan 21, 2015 at 4:44 PM, Peter Lin wool...@gmail.com wrote:

 I don't remember other people's examples in detail due to my shitty
 memory, so I'd rather not misquote.


 Fair enough, but maybe you shouldn't use people's examples you don't
 remenber as argument then. Those examples might be wrong or outdated and
 that kind of stuff creates confusion for everyone.



 In my case, I mix static and dynamic columns in a single column family
 with primitives and objects. The objects are temporal object graphs with a
 known type. Doing this type of stuff is basically transparent for me, since
 I'm using thrift and our data modeler generates helper classes. Our tooling
 seamlessly convert the bytes back to the target object. We have a few
 standard static columns related to 

Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-21 Thread Yatong Zhang
Thanks for the reply. The bootstrap of new node put a heavy burden on the
whole cluster and I don't know why. So that' the issue I want to fix
actually.

On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens migh...@gmail.com wrote:

 Yes, but it won't do what I suspect you're hoping for.  If you disable
 auto_bootstrap in cassandra.yaml the node will join the cluster and will
 not stream any old data from existing nodes.

 The cluster will now be in an inconsistent state.  If you bring enough
 nodes online this way to violate your read consistency level (eg RF=3,
 CL=Quorum, if you bring on 2 nodes this way), some of your queries will be
 missing data that they ought to have returned.

 There is no way to bring a new node online and have it be responsible just
 for new data, and have no responsibility for old data.  It *will* be
 responsible for old data, it just won't *know* about the old data it
 should be responsible for.  Executing a repair will fix this, but only
 because the existing nodes will stream all the missing data to the new
 node.  This will create more pressure on your cluster than just normal
 bootstrapping would have.

 I can't think of any reason you'd want to do that unless you needed to
 grow your cluster really quickly, and were ok with corrupting your old data.

 On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang bluefl...@gmail.com
 wrote:

 Hi there,

 I am using C* 2.0.10 and I was trying to add a new node to a
 cluster(actually replace a dead node). But after added the new node some
 other nodes in the cluster had a very high work-load and affected the whole
 performance of the cluster.
 So I am wondering is there a way to add a new node and this node only
 afford new data?





Fwd: ReadTimeoutException in Cassandra 2.0.11

2015-01-21 Thread Neha Trivedi
Hello All,
I am trying to process 200MB file. I am getting following Error. We are
using (apache-cassandra-2.0.3.jar)
com.datastax.driver.core.
exceptions.ReadTimeoutException: Cassandra timeout during read query at
consistency ONE (1 responses were required but only 0 replica responded)

1. Is it due to memory?
2. Is it related to driver?

Initially when I was trying 15MB and it was throwing the same Exception but
after that it started working.


thanks
regards
neha


Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-21 Thread Eric Stevens
Yes, bootstrapping a new node will cause read loads on your existing nodes
- it is becoming the owner and replica of a whole new set of existing
data.  To do that it needs to know what data it's now responsible for, and
that's what bootstrapping is for.

If you're at the point where bootstrapping a new node is placing a
too-heavy burden on your existing nodes, you may be dangerously close to or
even past the tipping point where you ought to have already grown your
cluster.  You need to grow your cluster as soon as possible, and chances
are you're close to no longer being able to keep up with compaction (see
nodetool compactionstats, make sure pending tasks is 5, preferably 0 or
1).  Once you're falling behind on compaction, it becomes difficult to
successfully bootstrap new nodes, and you're in a very tough spot.


On Wed, Jan 21, 2015 at 7:43 PM, Yatong Zhang bluefl...@gmail.com wrote:

 Thanks for the reply. The bootstrap of new node put a heavy burden on the
 whole cluster and I don't know why. So that' the issue I want to fix
 actually.

 On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens migh...@gmail.com wrote:

 Yes, but it won't do what I suspect you're hoping for.  If you disable
 auto_bootstrap in cassandra.yaml the node will join the cluster and will
 not stream any old data from existing nodes.

 The cluster will now be in an inconsistent state.  If you bring enough
 nodes online this way to violate your read consistency level (eg RF=3,
 CL=Quorum, if you bring on 2 nodes this way), some of your queries will be
 missing data that they ought to have returned.

 There is no way to bring a new node online and have it be responsible
 just for new data, and have no responsibility for old data.  It *will* be
 responsible for old data, it just won't *know* about the old data it
 should be responsible for.  Executing a repair will fix this, but only
 because the existing nodes will stream all the missing data to the new
 node.  This will create more pressure on your cluster than just normal
 bootstrapping would have.

 I can't think of any reason you'd want to do that unless you needed to
 grow your cluster really quickly, and were ok with corrupting your old data.

 On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang bluefl...@gmail.com
 wrote:

 Hi there,

 I am using C* 2.0.10 and I was trying to add a new node to a
 cluster(actually replace a dead node). But after added the new node some
 other nodes in the cluster had a very high work-load and affected the whole
 performance of the cluster.
 So I am wondering is there a way to add a new node and this node only
 afford new data?






Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-21 Thread Yatong Zhang
Yes, my cluster is almost full and there are lots of pending tasks. You
helped me a lot and thank you Eric~

On Thu, Jan 22, 2015 at 11:59 AM, Eric Stevens migh...@gmail.com wrote:

 Yes, bootstrapping a new node will cause read loads on your existing nodes
 - it is becoming the owner and replica of a whole new set of existing
 data.  To do that it needs to know what data it's now responsible for, and
 that's what bootstrapping is for.

 If you're at the point where bootstrapping a new node is placing a
 too-heavy burden on your existing nodes, you may be dangerously close to or
 even past the tipping point where you ought to have already grown your
 cluster.  You need to grow your cluster as soon as possible, and chances
 are you're close to no longer being able to keep up with compaction (see
 nodetool compactionstats, make sure pending tasks is 5, preferably 0 or
 1).  Once you're falling behind on compaction, it becomes difficult to
 successfully bootstrap new nodes, and you're in a very tough spot.


 On Wed, Jan 21, 2015 at 7:43 PM, Yatong Zhang bluefl...@gmail.com wrote:

 Thanks for the reply. The bootstrap of new node put a heavy burden on the
 whole cluster and I don't know why. So that' the issue I want to fix
 actually.

 On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens migh...@gmail.com wrote:

 Yes, but it won't do what I suspect you're hoping for.  If you disable
 auto_bootstrap in cassandra.yaml the node will join the cluster and will
 not stream any old data from existing nodes.

 The cluster will now be in an inconsistent state.  If you bring enough
 nodes online this way to violate your read consistency level (eg RF=3,
 CL=Quorum, if you bring on 2 nodes this way), some of your queries will be
 missing data that they ought to have returned.

 There is no way to bring a new node online and have it be responsible
 just for new data, and have no responsibility for old data.  It *will* be
 responsible for old data, it just won't *know* about the old data it
 should be responsible for.  Executing a repair will fix this, but only
 because the existing nodes will stream all the missing data to the new
 node.  This will create more pressure on your cluster than just normal
 bootstrapping would have.

 I can't think of any reason you'd want to do that unless you needed to
 grow your cluster really quickly, and were ok with corrupting your old data.

 On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang bluefl...@gmail.com
 wrote:

 Hi there,

 I am using C* 2.0.10 and I was trying to add a new node to a
 cluster(actually replace a dead node). But after added the new node some
 other nodes in the cluster had a very high work-load and affected the whole
 performance of the cluster.
 So I am wondering is there a way to add a new node and this node only
 afford new data?







Re: get partition key from tombstone warnings?

2015-01-21 Thread Ian Rose
Ah, thanks for the pointer Philip.  Is there any kind of formal way to
vote up issues?  I'm assuming that adding a comment of +1 or the like
is more likely to be *counter*productive.

- Ian


On Wed, Jan 21, 2015 at 5:02 PM, Philip Thompson 
philip.thomp...@datastax.com wrote:

 There is an open ticket for this improvement at
 https://issues.apache.org/jira/browse/CASSANDRA-8561

 On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose ianr...@fullstory.com wrote:

 When I see a warning like Read 9 live and 5769 tombstoned cells in ...
 etc is there a way for me to see the partition key that this query was
 operating on?

 The description in the original JIRA ticket (
 https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
 exposing this information was one of the original goals, but it isn't
 obvious to me in the logs...

 Cheers!
 - Ian





Re: Versioning in cassandra while indexing ?

2015-01-21 Thread Kai Wang
depending on your data model, static column night be useful.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-6561
On Jan 21, 2015 2:56 AM, Pandian R pandian4m...@gmail.com wrote:

 Hi,

 I just wanted to know if there is any kind of versioning system in
 cassandra while indexing new data(like the one we have for ElasticSearch,
 for example).

 For example, I have a series of payloads each coming with an id and
 'updatedAt' timestamp. I just want to maintain the latest state of any
 payload for all the ids ie, index the data only if the current payload has
 greater 'updatedAt' than the previously stored timestamp. I can do this
 with one additional self-lookup, but is there a way to achieve this without
 overhead of additional lookup ?

 Thanks !

 --
 Regards,
 Pandian



Re: Versioning in cassandra while indexing ?

2015-01-21 Thread graham sanderson
I believe you can use “USING TIMESTAMP XXX” with your inserts which will set 
the actual cell write times to the timestamp you provide. Then at least on read 
you’ll get the “latest” value… you may or may not incur an actual write of the 
old data to disk, but either way it’ll get cleaned up for you.

 On Jan 21, 2015, at 1:54 AM, Pandian R pandian4m...@gmail.com wrote:
 
 Hi,
 
 I just wanted to know if there is any kind of versioning system in cassandra 
 while indexing new data(like the one we have for ElasticSearch, for example). 
 
 For example, I have a series of payloads each coming with an id and 
 'updatedAt' timestamp. I just want to maintain the latest state of any 
 payload for all the ids ie, index the data only if the current payload has 
 greater 'updatedAt' than the previously stored timestamp. I can do this with 
 one additional self-lookup, but is there a way to achieve this without 
 overhead of additional lookup ?
 
 Thanks !
 
 -- 
 Regards,
 Pandian



smime.p7s
Description: S/MIME cryptographic signature


Re: Versioning in cassandra while indexing ?

2015-01-21 Thread Pandian R
Awesome. Thanks a lot Graham. Will use the clock timestamp for versioning :)

On Wed, Jan 21, 2015 at 2:02 PM, graham sanderson gra...@vast.com wrote:

 I believe you can use “USING TIMESTAMP XXX” with your inserts which will
 set the actual cell write times to the timestamp you provide. Then at least
 on read you’ll get the “latest” value… you may or may not incur an actual
 write of the old data to disk, but either way it’ll get cleaned up for you.

  On Jan 21, 2015, at 1:54 AM, Pandian R pandian4m...@gmail.com wrote:
 
  Hi,
 
  I just wanted to know if there is any kind of versioning system in
 cassandra while indexing new data(like the one we have for ElasticSearch,
 for example).
 
  For example, I have a series of payloads each coming with an id and
 'updatedAt' timestamp. I just want to maintain the latest state of any
 payload for all the ids ie, index the data only if the current payload has
 greater 'updatedAt' than the previously stored timestamp. I can do this
 with one additional self-lookup, but is there a way to achieve this without
 overhead of additional lookup ?
 
  Thanks !
 
  --
  Regards,
  Pandian




-- 
Regards,
Pandian


Re: Re: Dynamic Columns

2015-01-21 Thread Jonathan Lacefield
Hello,

  Peter highlighted the tradeoff between Thrift and CQL3 nicely in this
case, i.e. requiring a different design approach for this solution.
Collections do not sound like a good fit for your current challenge, but is
there a different way to design/solve your challenge using CQL techniques?

  It is recommended to leverage CQL for new projects as this is the
direction that Cassandra is heading and where the majority of effort is
being applied from a development perspective.

  Sounds like you have a decision to make.  Leverage Thrift and the Dynamic
Column approach to solving this problem.  Or, rethink the design approach
and leverage CQL.

  Please let the mailing list know the direction you choose.

Jonathan

[image: datastax_logo.png]

Jonathan Lacefield

Solution Architect | (404) 822 3487 | jlacefi...@datastax.com

[image: linkedin.png] http://www.linkedin.com/in/jlacefield/ [image:
facebook.png] https://www.facebook.com/datastax [image: twitter.png]
https://twitter.com/datastax [image: g+.png]
https://plus.google.com/+Datastax/about
http://feeds.feedburner.com/datastax https://github.com/datastax/

On Tue, Jan 20, 2015 at 9:46 PM, Peter Lin wool...@gmail.com wrote:


 the thing is, CQL only handles some types of dynamic column use cases.
 There's plenty of examples on datastax.com that shows how to do CQL style
 dynamic columns.

 based on what was described by Chetan, I don't feel CQL3 is a perfect fit
 for what he wants to do. To use CQL3, he'd have to change his approach.

 In my temporal database, I use both Thrift and CQL. They compliment each
 other very nice. I don't understand why people have to put down Thrift or
 pretend it supports 100% of the use cases. Lots of people who started using
 Cassandra pre CQL and had no problems using thrift. Yes you have to
 understand more and the learning curve is steeper, but taking time to learn
 the internals of cassandra is a good thing.

 Using CQL3 lists or maps, it would force the query to load the enter
 collection, but that is by design. To get the full power of the old style
 of dynamic columns, thrift is a better fit. I hope CQL continues to improve
 so that it supports 100% of the existing use cases.



 On Tue, Jan 20, 2015 at 8:50 PM, Xu Zhongxing xu_zhong_x...@163.com
 wrote:

 I approximate dynamic columns by data_key and data_value columns.
 Is there a better way to get dynamic columns in CQL 3?

 At 2015-01-21 09:41:02, Peter Lin wool...@gmail.com wrote:


 I think that table example misses the point of chetan's functional
 requirement. he actually needs dynamic columns.

 On Tue, Jan 20, 2015 at 8:12 PM, Xu Zhongxing xu_zhong_x...@163.com
 wrote:

 Maybe this is the closest thing to dynamic columns in CQL 3.

 create table reivew (
 product_id bigint,
 created_at timestamp,
 data_key text,
 data_tvalue text,
 data_ivalue int,
 primary key ((priduct_id, created_at), data_key)
 );

 data_tvalue and data_ivalue is optional.

 At 2015-01-21 04:44:07, chetan verma chetanverm...@gmail.com wrote:

 Hi,

 Adding to previous mail. For example: We have a column family named
 review (with some arbitrary data in map).

 CREATE TABLE review(
 product_id bigint,
 created_at timestamp,
 data_int maptext, int,
 data_text maptext, text,
 PRIMARY KEY (product_id, created_at)
 );

 Assume that these 2 maps I use to store arbitrary data (i.e. data_int
 and data_text for int and text values)
 when we see output on cassandra-cli, it looks like in a partition as :
 clustering_key:data_int:map_key as column name and value as map value.
 suppose I need to get this value, I couldn't do that with CQL3 but in
 thrift its possible. Any Solution?

 On Wed, Jan 21, 2015 at 1:06 AM, chetan verma chetanverm...@gmail.com
 wrote:

 Hi,

 Most of the time I will  be querying on product_id and created_at, but
 for analytic I need to query almost on all column.
 Multiple collections ideas is good but the only is cassandra reads a
 collection entirely, what if I need a slice of it, I mean
 columns for certain keys which is possible with thrift. Please suggest.

 On Wed, Jan 21, 2015 at 12:36 AM, Jonathan Lacefield 
 jlacefi...@datastax.com wrote:

 Hello,

 There are probably lots of options to this challenge.  The more
 details around your use case that you can provide, the easier it will be
 for this group to offer advice.

 A few follow-up questions:
   - How will you query this data?
   - Do your queries require filtering on specific columns other than
 product_id and created_at, i.e. the dynamic columns?

 Depending on the answers to these questions, you have several options,
 of which here are a few:

- Cassandra efficiently stores sparse data, so you could create
columns and not populate them, without much of a penalty
- Could use a clustering column to store a columns type and
another col (potentially clustering) to store the value
   - i.e. CREATE TABLE foo (col1 int, attname text, attvalue text,
   col4...n, 

row cache hit is costlier for partiton with large rows

2015-01-21 Thread nitin padalia
Hi,

With two different families when I do a read, row cache hit is almost
15x costlier with larger rows (1 rows per partition), in
comparison to partition with only 100 rows.

Difference in two column families is one is having 100 rows per
partition another 1 rows per partition. Schema for two tables is:
CREATE TABLE table1_row_cache (
  user_id uuid,
  dept_id uuid,
  location_id text,
  locationmap_id uuid,
  PRIMARY KEY ((user_id, location_id), dept_id)
)

CREATE TABLE table2_row_cache (
  user_id uuid,
  dept_id uuid,
  location_id text,
  locationmap_id uuid,
  PRIMARY KEY ((user_id, dept_id), location_id)
)

Here is the tracing:

Row cache Hit with Column Family table1_row_cache, 100 rows per partition:
 Preparing statement [SharedPool-Worker-2] | 2015-01-20
14:35:47.54 | x.x.x.x |   1023
  Row cache hit [SharedPool-Worker-5] | 2015-01-20
14:35:47.542000 | x.x.x.x |   2426

Row cache Hit with CF table2_row_cache, 1 rows per partition:
Preparing statement [SharedPool-Worker-1] | 2015-01-20 16:02:51.696000
| x.x.x.x |490
 Row cache hit [SharedPool-Worker-2] | 2015-01-20
16:02:51.711000 | x.x.x.x |  15146


If for both cases data is in memory why its not same? Can someone
point me what wrong here?

Nitin Padalia


Re: keyspace not exists?

2015-01-21 Thread Jason Wee
Thanks Rob, we keep this in mind for our learning journey.

Jason

On Wed, Jan 21, 2015 at 6:45 AM, Robert Coli rc...@eventbrite.com wrote:

 On Sun, Jan 18, 2015 at 8:55 PM, Jason Wee peich...@gmail.com wrote:

 two nodes running cassandra 2.1.2 and one running cassandra 2.1.1


 For the record, this is an unsupported persistent configuration. You are
 only supposed to have split minor versions during an upgrade.

 I have no idea if it is causing the problem you are having.

 =Rob




cassandra-stress - confusing documentation

2015-01-21 Thread Tzach Livyatan
Hi all
I'm using cassandra-stress directly from apache-cassandra-2.1.2/tools/bin
The documentation I found
http://www.datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCStress_t.html
is either too old or too advance, but does not match what I use.

In particular, I fail to use the -key populate=1..100 option as used in
the two nodes example from the link above.

#On Node1$ cassandra-stress write tries=20 n=100 cl=one -mode
native cql3 -schema keyspace=Keyspace1 -key populate=1..100 -log
file=~/node1_load.log -node $NODES
 #On Node2$ cassandra-stress write tries=20 n=100 cl=one -mode
native cql3 -schema keyspace=Keyspace1 -key
populate=101..200 -log file=~/node2_load.log -node $NODES

Can some one please direct me to the right doc, or to a valid example of
using populate range?

Thanks
Tzach


Re: Re: Dynamic Columns

2015-01-21 Thread Peter Lin
I've studied the source code and I don't believe that statement is true.
I've chatted with several long time users of Cassandra and there's things
CQL3 doesn't support.

Like I've said before. Thrift and CQL3 compliment each other. I totally
understand some committers don't want the overhead due to time and resource
limitations. On more than one occassion, people have offered to help and
work on thrift, but were rejected. There's logs in jira.

For the record, it's great that CQL was created to make life easier for new
users. But here's the thing that annoys me. There's users that just want to
save and query data, but there's people out there like me that are building
tools for Cassandra. For tool builders, having object API like thrift is
invaluable. If we look at relational databases, we see many of them have 2
separate API for that reason. Microsoft SqlServer has SQL and object API.
Having both makes it easier to build tools. It's a shame to ignore all the
lessons RDBMS can teach us and suffer NIH syndrome. I've built several data
modeling tools over the years including ORM's.

We built our own data modeling tool for the temporal database I built on
Cassandra, so this isn't just some hypothetical complaint. This is from
many years of first hand experience. I understand my needs often don't and
won't line up with what's in Cassandra's roadmap. But that's the great
thing about open source. Should thrift go away permanently I'll just fork
Cassandra and do my own thing.


On Wed, Jan 21, 2015 at 8:53 AM, Sylvain Lebresne sylv...@datastax.com
wrote:

 On Wed, Jan 21, 2015 at 3:46 AM, Peter Lin wool...@gmail.com wrote:


  I don't understand why people [...] pretend it supports 100% of the use
 cases.


 Have you consider the possibly that it's actually true and you're just
 wrong by lack of knowledge?

 --
 Sylvain



Re: Count of column values

2015-01-21 Thread Poonam Ligade
Hi ,

Sorry for the previous incomplete message.
I am using where clause as follows:
select count(*) from trends where data1='abc' ALLOW FILTERING;
How can i store this count output to any other column.

Can you help with any wayround.

Thanks,
Poonam.

On Wed, Jan 21, 2015 at 7:46 PM, Poonam Ligade poonam.v.lig...@gmail.com
wrote:

 Hi,

 I am newbie to Cassandra.
 I have to find out top 10 recent trends in data

 I have schema as follows

 create table trends(
 day int,
 data1 text,
 data2 mapint, decimal,
 PRIMARY KEY (day, data1)) ;

 I have to take count of duplicate values in data1 so that i can find top
 10 data1 trends.

 1. I tried adding an counter column, but again you can't use order by
 clause on counter column.
 2. I tried using where clause



Re: Re: Dynamic Columns

2015-01-21 Thread Sylvain Lebresne
On Wed, Jan 21, 2015 at 3:46 AM, Peter Lin wool...@gmail.com wrote:


  I don't understand why people [...] pretend it supports 100% of the use
 cases.


Have you consider the possibly that it's actually true and you're just
wrong by lack of knowledge?

--
Sylvain


Re: row cache hit is costlier for partiton with large rows

2015-01-21 Thread Sylvain Lebresne
The row cache saves partition data off-heap, which means that every cache
hit require copying/deserializing the cached partition into the heap, and
the more rows per partition you cache, the long it will take. Which is why
it's currently not a good cache too much rows per partition (unless you
know what you're doing).

On Wed, Jan 21, 2015 at 1:15 PM, nitin padalia padalia.ni...@gmail.com
wrote:

 Hi,

 With two different families when I do a read, row cache hit is almost
 15x costlier with larger rows (1 rows per partition), in
 comparison to partition with only 100 rows.

 Difference in two column families is one is having 100 rows per
 partition another 1 rows per partition. Schema for two tables is:
 CREATE TABLE table1_row_cache (
   user_id uuid,
   dept_id uuid,
   location_id text,
   locationmap_id uuid,
   PRIMARY KEY ((user_id, location_id), dept_id)
 )

 CREATE TABLE table2_row_cache (
   user_id uuid,
   dept_id uuid,
   location_id text,
   locationmap_id uuid,
   PRIMARY KEY ((user_id, dept_id), location_id)
 )

 Here is the tracing:

 Row cache Hit with Column Family table1_row_cache, 100 rows per partition:
  Preparing statement [SharedPool-Worker-2] | 2015-01-20
 14:35:47.54 | x.x.x.x |   1023
   Row cache hit [SharedPool-Worker-5] | 2015-01-20
 14:35:47.542000 | x.x.x.x |   2426

 Row cache Hit with CF table2_row_cache, 1 rows per partition:
 Preparing statement [SharedPool-Worker-1] | 2015-01-20 16:02:51.696000
 | x.x.x.x |490
  Row cache hit [SharedPool-Worker-2] | 2015-01-20
 16:02:51.711000 | x.x.x.x |  15146


 If for both cases data is in memory why its not same? Can someone
 point me what wrong here?

 Nitin Padalia



Count of column values

2015-01-21 Thread Poonam Ligade
Hi,

I am newbie to Cassandra.
I have to find out top 10 recent trends in data

I have schema as follows

create table trends(
day int,
data1 text,
data2 mapint, decimal,
PRIMARY KEY (day, data1)) ;

I have to take count of duplicate values in data1 so that i can find top 10
data1 trends.

1. I tried adding an counter column, but again you can't use order by
clause on counter column.
2. I tried using where clause


Re: Re: Dynamic Columns

2015-01-21 Thread Sylvain Lebresne
 I've chatted with several long time users of Cassandra and there's things
 CQL3 doesn't support.


Would you care to elaborate then? Maybe a simple example of something (or
multiple things since you used plural) in thrift that cannot be supported
in CQL?
And please note that I'm *not* saying that all existing thrift table can be
seemlessly used from CQL: there is indeed a few cases for which that's not
the case. But that does not mean those cases cannot easily be in CQL from
scratch.