Most stable version?

2016-04-11 Thread Jean Tremblay
Hi,
Which version of Cassandra should considered most stable in the version 3?
I see two main branch: the branch with the version 3.0.* and the tick-tock one 
3.*.*.
So basically my question is: which one is most stable, version 3.0.5 or version 
3.3?
I know odd versions in tick-took are bug fix. 
Thanks
Jean


Re: Large primary keys

2016-04-11 Thread Jack Krupansky
Check out the text indexing feature of the new SASI feature in Cassandra
3.4. You could write a custom tokenizer to extract entities and then be
able to query for documents that contain those entities.

That said, using a SHA digest key for the primary key has merit for direct
access to the document given the document text.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 7:12 PM, James Carman 
wrote:

> S3 maybe?
>
> On Mon, Apr 11, 2016 at 7:05 PM Robert Wille  wrote:
>
>> I do realize its kind of a weird use case, but it is legitimate. I have a
>> collection of documents that I need to index, and I want to perform entity
>> extraction on them and give the extracted entities special treatment in my
>> full-text index. Because entity extraction costs money, and each document
>> will end up being indexed multiple times, I want to cache them in
>> Cassandra. The document text is the obvious key to retrieve entities from
>> the cache. If I use the document ID, then I have to track timestamps. I
>> know that sounds like a simple workaround, but I’m presenting a
>> much-simplified view of my actual data model.
>>
>> The reason for needing the text in the table, and not just a digest, is
>> that sometimes entity extraction has to be deferred due to license
>> limitations. In those cases, the entity extraction occurs on a background
>> process, and the entities will be included in the index the next time the
>> document is indexed.
>>
>> I will use a digest as the key. I suspected that would be the answer, but
>> its good to get confirmation.
>>
>> Robert
>>
>> On Apr 11, 2016, at 4:36 PM, Jan Kesten  wrote:
>>
>> > Hi Robert,
>> >
>> > why do you need the actual text as a key? I sounds a bit unatural at
>> least for me. Keep in mind that you cannot do "like" queries on keys in
>> cassandra. For performance and keeping things more readable I would prefer
>> hashing your text and use the hash as key.
>> >
>> > You should also take into account to store the keys (hashes) in a
>> seperate table per day / hour or something like that, so you can quickly
>> get all keys for a time range. A query without the partition key may be
>> very slow.
>> >
>> > Jan
>> >
>> > Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> >> I have a need to be able to use the text of a document as the primary
>> key in a table. These texts are usually less than 1K, but can sometimes be
>> 10’s of K’s in size. Would it be better to use a digest of the text as the
>> key? I have a background process that will occasionally need to do a full
>> table scan and retrieve all of the texts, so using the digest doesn’t
>> eliminate the need to store the text. Anyway, is it better to keep primary
>> keys small, or is C* okay with large primary keys?
>> >>
>> >> Robert
>> >>
>> >
>>
>>


Re: Large primary keys

2016-04-11 Thread James Carman
S3 maybe?
On Mon, Apr 11, 2016 at 7:05 PM Robert Wille  wrote:

> I do realize its kind of a weird use case, but it is legitimate. I have a
> collection of documents that I need to index, and I want to perform entity
> extraction on them and give the extracted entities special treatment in my
> full-text index. Because entity extraction costs money, and each document
> will end up being indexed multiple times, I want to cache them in
> Cassandra. The document text is the obvious key to retrieve entities from
> the cache. If I use the document ID, then I have to track timestamps. I
> know that sounds like a simple workaround, but I’m presenting a
> much-simplified view of my actual data model.
>
> The reason for needing the text in the table, and not just a digest, is
> that sometimes entity extraction has to be deferred due to license
> limitations. In those cases, the entity extraction occurs on a background
> process, and the entities will be included in the index the next time the
> document is indexed.
>
> I will use a digest as the key. I suspected that would be the answer, but
> its good to get confirmation.
>
> Robert
>
> On Apr 11, 2016, at 4:36 PM, Jan Kesten  wrote:
>
> > Hi Robert,
> >
> > why do you need the actual text as a key? I sounds a bit unatural at
> least for me. Keep in mind that you cannot do "like" queries on keys in
> cassandra. For performance and keeping things more readable I would prefer
> hashing your text and use the hash as key.
> >
> > You should also take into account to store the keys (hashes) in a
> seperate table per day / hour or something like that, so you can quickly
> get all keys for a time range. A query without the partition key may be
> very slow.
> >
> > Jan
> >
> > Am 11.04.2016 um 23:43 schrieb Robert Wille:
> >> I have a need to be able to use the text of a document as the primary
> key in a table. These texts are usually less than 1K, but can sometimes be
> 10’s of K’s in size. Would it be better to use a digest of the text as the
> key? I have a background process that will occasionally need to do a full
> table scan and retrieve all of the texts, so using the digest doesn’t
> eliminate the need to store the text. Anyway, is it better to keep primary
> keys small, or is C* okay with large primary keys?
> >>
> >> Robert
> >>
> >
>
>


Re: Large primary keys

2016-04-11 Thread Robert Wille
I do realize its kind of a weird use case, but it is legitimate. I have a 
collection of documents that I need to index, and I want to perform entity 
extraction on them and give the extracted entities special treatment in my 
full-text index. Because entity extraction costs money, and each document will 
end up being indexed multiple times, I want to cache them in Cassandra. The 
document text is the obvious key to retrieve entities from the cache. If I use 
the document ID, then I have to track timestamps. I know that sounds like a 
simple workaround, but I’m presenting a much-simplified view of my actual data 
model.

The reason for needing the text in the table, and not just a digest, is that 
sometimes entity extraction has to be deferred due to license limitations. In 
those cases, the entity extraction occurs on a background process, and the 
entities will be included in the index the next time the document is indexed.

I will use a digest as the key. I suspected that would be the answer, but its 
good to get confirmation.

Robert

On Apr 11, 2016, at 4:36 PM, Jan Kesten  wrote:

> Hi Robert,
> 
> why do you need the actual text as a key? I sounds a bit unatural at least 
> for me. Keep in mind that you cannot do "like" queries on keys in cassandra. 
> For performance and keeping things more readable I would prefer hashing your 
> text and use the hash as key.
> 
> You should also take into account to store the keys (hashes) in a seperate 
> table per day / hour or something like that, so you can quickly get all keys 
> for a time range. A query without the partition key may be very slow.
> 
> Jan
> 
> Am 11.04.2016 um 23:43 schrieb Robert Wille:
>> I have a need to be able to use the text of a document as the primary key in 
>> a table. These texts are usually less than 1K, but can sometimes be 10’s of 
>> K’s in size. Would it be better to use a digest of the text as the key? I 
>> have a background process that will occasionally need to do a full table 
>> scan and retrieve all of the texts, so using the digest doesn’t eliminate 
>> the need to store the text. Anyway, is it better to keep primary keys small, 
>> or is C* okay with large primary keys?
>> 
>> Robert
>> 
> 



Re: Unable to connect to CQLSH or Launch SparkContext

2016-04-11 Thread Bryan Cheng
Check your environment variables, looks like JAVA_HOME is not properly set

On Mon, Apr 11, 2016 at 9:07 AM, Lokesh Ceeba - Vendor <
lokesh.ce...@walmart.com> wrote:

> Hi Team,
>
>   Help required
>
>
>
> cassandra:/app/cassandra $ nodetool status
>
>
>
> Cassandra 2.0 and later require Java 7u25 or later.
>
> cassandra:/app/cassandra $ nodetool status
>
>
>
> Cassandra 2.0 and later require Java 7u25 or later.
>
> cassandra:/app/cassandra $ java -version
>
> Error occurred during initialization of VM
>
> java.lang.OutOfMemoryError: unable to create new native thread
>
>
>
>
>
>
>
> --
>
> Lokesh
> This email and any files transmitted with it are confidential and intended
> solely for the individual or entity to whom they are addressed. If you have
> received this email in error destroy it immediately. *** Walmart
> Confidential ***
>


Re: Large primary keys

2016-04-11 Thread Jan Kesten

Hi Robert,

why do you need the actual text as a key? I sounds a bit unatural at 
least for me. Keep in mind that you cannot do "like" queries on keys in 
cassandra. For performance and keeping things more readable I would 
prefer hashing your text and use the hash as key.


You should also take into account to store the keys (hashes) in a 
seperate table per day / hour or something like that, so you can quickly 
get all keys for a time range. A query without the partition key may be 
very slow.


Jan

Am 11.04.2016 um 23:43 schrieb Robert Wille:

I have a need to be able to use the text of a document as the primary key in a 
table. These texts are usually less than 1K, but can sometimes be 10’s of K’s 
in size. Would it be better to use a digest of the text as the key? I have a 
background process that will occasionally need to do a full table scan and 
retrieve all of the texts, so using the digest doesn’t eliminate the need to 
store the text. Anyway, is it better to keep primary keys small, or is C* okay 
with large primary keys?

Robert





Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Jim Ancona
On Mon, Apr 11, 2016 at 4:19 PM, Jack Krupansky 
wrote:

> Some of this may depend on exactly how you are using so-called COMPACT
> STORAGE. I mean, if your tables really are modeled as all but exactly one
> column in the primary key, then okay, COMPACT STORAGE may be a reasonable
> model, but that seems to be a very special, narrow use case, so for all
> other cases you really do need to re-model for CQL for Cassandra 4.0.
>
There was no such restriction when modeling with Thrift. It's an artifact
of how CQL chose to expose the Thrift data model.

I'm not sure why anybody is thinking otherwise. Sure, maybe will be a lot
> of work, but that's life and people have been given plenty of notice.
>
"That's life" minimizes the difficulty of doing this sort of migration for
large, mission-critical systems. It would require large amounts of time, as
well as temporarily doubling hardware resources amounting to dozens up to
hundreds of nodes.

And if it takes hours to do a data migration, I think that you can consider
> yourself lucky relative to people who may require days.
>
Or more.

Now, if there are particular Thrift use cases that don't have efficient
> models in CQL, that can be discussed. Start by expressing the Thrift data
> in a neutral, natural, logical, plain English data model, and then we can
> see how that maps to CQL.
>
> So, where are we? Is it just the complaint that migration is slow and
> re-modeling is difficult, or are there specific questions about how to do
> the re-modeling?
>
My purpose is not to complain, but to educate :-). Telling someone "just
remodel your data" is not helpful, especially after he's told you that he
tried that and ran into performance issues. (Note that the link he posted
shows an order of magnitude decrease in throughput when moving from COMPACT
STORE to CQL3 native tables for analytics workloads, so it's not just his
use case.) Do you have any suggestions of ways he might mitigate those
issues? Is there information you need to make such a recommendation?

Jim


>
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 1:30 PM, Anuj Wadehra 
> wrote:
>
>> Thanks Jim. I think you understand the pain of migrating TBs of data to
>> new tables. There is no command to change from compact to non compact
>> storage and the fastest solution to migrate data using Spark is too slow
>> for production systems.
>>
>> And the pain gets bigger when your performance dips after moving to non
>> compact storage table. Thats because non compact storage is quite
>> inefficient storage format till 3.x and its incurs heavy penalty on Row
>> Scan performance in Analytics workload.
>> Please go throught the link to understand how old Compact storage gives
>> much better performance than non compact storage as far as Row Scans are
>> concerned:
>> https://www.oreilly.com/ideas/apache-cassandra-for-analytics-a-performance-and-storage-analysis
>>
>> The flexibility of Cql comes at heavy cost until 3.x.
>>
>>
>>
>> Thanks
>> Anuj
>> Sent from Yahoo Mail on Android
>> 
>>
>> On Mon, 11 Apr, 2016 at 10:35 PM, Jim Ancona
>>  wrote:
>> Jack, the Datastax link he posted (
>> http://www.datastax.com/dev/blog/thrift-to-cql3) says that for column
>> families with mixed dynamic and static columns: "The only solution to be
>> able to access the column family fully is to remove the declared columns
>> from the thrift schema altogether..." I think that page describes the
>> problem and the potential solutions well. I haven't seen an answer to
>> Anuj's question about why the native CQL solution using collections doesn't
>> perform as well.
>>
>> Keep in mind that some of us understand CQL just fine but have working
>> pre-CQL Thrift-based systems storing hundreds of terabytes of data and with
>> requirements that mean that saying "bite the bullet and re-model your
>> data" is not really helpful. Another quote from that Datastax link:
>> "Thrift isn't going anywhere." Granted that that link is three-plus years
>> old, but Thrift now *is* now going away, so it's not unexpected that people
>> will be trying to figure out how to deal with that. It's bad enough that we
>> need to rewrite our clients to use CQL instead of Thrift. It's not helpful
>> to say that we should also re-model and migrate all our data.
>>
>> Jim
>>
>> On Mon, Apr 11, 2016 at 11:29 AM, Jack Krupansky <
>> jack.krupan...@gmail.com> wrote:
>>
>>> Sorry, but your message is too confusing - you say "reading dynamic
>>> columns in CQL" and "make the table schema less", but neither has any
>>> relevance to CQL! 1. CQL tables always have schemas. 2. All columns in CQL
>>> are statically declared (even maps/collections are statically declared
>>> columns.) Granted, it is a challenge for Thrift users to get used to the
>>> terminology of CQL, but it is required. If necessary, review some of the
>>> free online training videos for 

Re: Large primary keys

2016-04-11 Thread James Carman
Why does the text need to be the key?

On Mon, Apr 11, 2016 at 6:04 PM Robert Wille  wrote:

> I have a need to be able to use the text of a document as the primary key
> in a table. These texts are usually less than 1K, but can sometimes be 10’s
> of K’s in size. Would it be better to use a digest of the text as the key?
> I have a background process that will occasionally need to do a full table
> scan and retrieve all of the texts, so using the digest doesn’t eliminate
> the need to store the text. Anyway, is it better to keep primary keys
> small, or is C* okay with large primary keys?
>
> Robert
>
>


Re: Large primary keys

2016-04-11 Thread Bryan Cheng
While large primary keys (within reason) should work, IMO anytime you're
doing equality testing you are really better off minimizing the size of the
key. Huge primary keys will also have very negative impacts on your key
cache. I would err on the side of the digest, but I've never had a need for
large keys so perhaps someone who has used them before would have a
different perspective.

On Mon, Apr 11, 2016 at 2:43 PM, Robert Wille  wrote:

> I have a need to be able to use the text of a document as the primary key
> in a table. These texts are usually less than 1K, but can sometimes be 10’s
> of K’s in size. Would it be better to use a digest of the text as the key?
> I have a background process that will occasionally need to do a full table
> scan and retrieve all of the texts, so using the digest doesn’t eliminate
> the need to store the text. Anyway, is it better to keep primary keys
> small, or is C* okay with large primary keys?
>
> Robert
>
>


Large primary keys

2016-04-11 Thread Robert Wille
I have a need to be able to use the text of a document as the primary key in a 
table. These texts are usually less than 1K, but can sometimes be 10’s of K’s 
in size. Would it be better to use a digest of the text as the key? I have a 
background process that will occasionally need to do a full table scan and 
retrieve all of the texts, so using the digest doesn’t eliminate the need to 
store the text. Anyway, is it better to keep primary keys small, or is C* okay 
with large primary keys?

Robert



Restricting secondary indexes

2016-04-11 Thread Thanigai Vellore
Hello,

In a multi-DC setup (where one DC serves real-time traffic and the other DC 
serves up analytical loads), is it possible to setup and restrict secondary 
indexes only to the analytics DC? The intent is to not create the overhead of 
the secondary index on the DC where real-time traffic is served. Are there any 
other recommendations to achieve this?

-Thanigai
Information contained in this e-mail message is confidential. This e-mail 
message is intended only for the personal use of the recipient(s) named above. 
If you are not an intended recipient, do not read, distribute or reproduce this 
transmission (including any attachments). If you have received this email in 
error, please immediately notify the sender by email reply and delete the 
original message.


Re: DataStax OpsCenter with Apache Cassandra

2016-04-11 Thread James Carman
Since when did this become a DataStax support email list?  If folks have
questions about DataStax products, shouldn't they be contacting the company
directly?


On Sun, Apr 10, 2016 at 1:13 PM Jeff Jirsa 
wrote:

> It is possible to use OpsCenter for open source / community versions up to
> 2.2.x. It will not be possible in 3.0+
>
>
>
> From: Anuj Wadehra
> Reply-To: "user@cassandra.apache.org"
> Date: Sunday, April 10, 2016 at 9:28 AM
> To: User
> Subject: DataStax OpsCenter with Apache Cassandra
>
> Hi,
>
> Is it possible to use DataStax OpsCenter for monitoring Apache distributed
> Cassandra in Production?
>
> OR
>
>  Is it possible to use DataStax OpsCenter if you are not using DataStax
> Enterprise in production?
>
>
> Thanks
> Anuj
>


Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Anuj Wadehra
Thanks Jim. I think you understand the pain of migrating TBs of data to new 
tables. There is no command to change from compact to non compact storage and 
the fastest solution to migrate data using Spark is too slow for production 
systems.
And the pain gets bigger when your performance dips after moving to non compact 
storage table. Thats because non compact storage is quite inefficient storage 
format till 3.x and its incurs heavy penalty on Row Scan performance in 
Analytics workload.Please go throught the link to understand how old Compact 
storage gives much better performance than non compact storage as far as Row 
Scans are concerned: 
https://www.oreilly.com/ideas/apache-cassandra-for-analytics-a-performance-and-storage-analysis
The flexibility of Cql comes at heavy cost until 3.x.


ThanksAnujSent from Yahoo Mail on Android 
 
  On Mon, 11 Apr, 2016 at 10:35 PM, Jim Ancona wrote:   
Jack, the Datastax link he posted 
(http://www.datastax.com/dev/blog/thrift-to-cql3) says that for column families 
with mixed dynamic and static columns: "The only solution to be able to access 
the column family fully is to remove the declared columns from the thrift 
schema altogether..." I think that page describes the problem and the potential 
solutions well. I haven't seen an answer to Anuj's question about why the 
native CQL solution using collections doesn't perform as well.
Keep in mind that some of us understand CQL just fine but have working pre-CQL 
Thrift-based systems storing hundreds of terabytes of data and with 
requirements that mean that saying "bite the bullet and re-model your data" is 
not really helpful. Another quote from that Datastax link: "Thrift isn't going 
anywhere." Granted that that link is three-plus years old, but Thrift now *is* 
now going away, so it's not unexpected that people will be trying to figure out 
how to deal with that. It's bad enough that we need to rewrite our clients to 
use CQL instead of Thrift. It's not helpful to say that we should also re-model 
and migrate all our data.
Jim
On Mon, Apr 11, 2016 at 11:29 AM, Jack Krupansky  
wrote:

Sorry, but your message is too confusing - you say "reading dynamic columns in 
CQL" and "make the table schema less", but neither has any relevance to CQL! 1. 
CQL tables always have schemas. 2. All columns in CQL are statically declared 
(even maps/collections are statically declared columns.) Granted, it is a 
challenge for Thrift users to get used to the terminology of CQL, but it is 
required. If necessary, review some of the free online training videos for data 
modeling.
Unless your data model is very simply and does directly translate into CQL, you 
probably do need to bite the bullet and re-model your data to exploit the 
features of CQL rather than fight CQL trying to mimic Thrift per se.
In any case, take another shot at framing the problem and then maybe people 
here can help you out.
-- Jack Krupansky
On Mon, Apr 11, 2016 at 10:39 AM, Anuj Wadehra  wrote:

Any comments or suggestions on this one? 
ThanksAnuj

Sent from Yahoo Mail on Android 
 
 On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra wrote:  
 Hi
We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but facing 
some challenges.
We have a cf with a mix of statically defined columns and dynamic columns 
(created at run time). For reading dynamic columns in CQL, we have two options:
1. Drop all columns and make the table schema less. This way, we will get a Cql 
row for each column defined for a row key--As mentioned here: 
http://www.datastax.com/dev/blog/thrift-to-cql3
2.Migrate entire data to a new non compact storage table and create collections 
for dynamic columns in new table.
In our case, we have observed that approach 2 causes 3 times slower performance 
in Range scan queries used by Spark. This is not acceptable. Cassandra 3 has 
optimized storage engine but we are not comfortable moving to 3.x in production.
Moreover, data migration to new table using Spark takes hours. 

Any suggestions for the two issues?

ThanksAnuj

Sent from Yahoo Mail on Android  




  


Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Jim Ancona
Jack, the Datastax link he posted (
http://www.datastax.com/dev/blog/thrift-to-cql3) says that for column
families with mixed dynamic and static columns: "The only solution to be
able to access the column family fully is to remove the declared columns
from the thrift schema altogether..." I think that page describes the
problem and the potential solutions well. I haven't seen an answer to
Anuj's question about why the native CQL solution using collections doesn't
perform as well.

Keep in mind that some of us understand CQL just fine but have working
pre-CQL Thrift-based systems storing hundreds of terabytes of data and with
requirements that mean that saying "bite the bullet and re-model your data"
is not really helpful. Another quote from that Datastax link: "Thrift isn't
going anywhere." Granted that that link is three-plus years old, but Thrift
now *is* now going away, so it's not unexpected that people will be trying
to figure out how to deal with that. It's bad enough that we need to
rewrite our clients to use CQL instead of Thrift. It's not helpful to say
that we should also re-model and migrate all our data.

Jim

On Mon, Apr 11, 2016 at 11:29 AM, Jack Krupansky 
wrote:

> Sorry, but your message is too confusing - you say "reading dynamic
> columns in CQL" and "make the table schema less", but neither has any
> relevance to CQL! 1. CQL tables always have schemas. 2. All columns in CQL
> are statically declared (even maps/collections are statically declared
> columns.) Granted, it is a challenge for Thrift users to get used to the
> terminology of CQL, but it is required. If necessary, review some of the
> free online training videos for data modeling.
>
> Unless your data model is very simply and does directly translate into
> CQL, you probably do need to bite the bullet and re-model your data to
> exploit the features of CQL rather than fight CQL trying to mimic Thrift
> per se.
>
> In any case, take another shot at framing the problem and then maybe
> people here can help you out.
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 10:39 AM, Anuj Wadehra 
> wrote:
>
>> Any comments or suggestions on this one?
>>
>> Thanks
>> Anuj
>>
>> Sent from Yahoo Mail on Android
>> 
>>
>> On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra
>>  wrote:
>> Hi
>>
>> We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but
>> facing some challenges.
>>
>> We have a cf with a mix of statically defined columns and dynamic columns
>> (created at run time). For reading dynamic columns in CQL,
>> we have two options:
>>
>> 1. Drop all columns and make the table schema less. This way, we will get
>> a Cql row for each column defined for a row key--As mentioned here:
>> http://www.datastax.com/dev/blog/thrift-to-cql3
>>
>> 2.Migrate entire data to a new non compact storage table and create
>> collections for dynamic columns in new table.
>>
>> In our case, we have observed that approach 2 causes 3 times slower
>> performance in Range scan queries used by Spark. This is not acceptable.
>> Cassandra 3 has optimized storage engine but we are not comfortable moving
>> to 3.x in production.
>>
>> Moreover, data migration to new table using Spark takes hours.
>>
>> Any suggestions for the two issues?
>>
>>
>> Thanks
>> Anuj
>>
>>
>> Sent from Yahoo Mail on Android
>> 
>>
>>
>


unsubscribe

2016-04-11 Thread Gvb Subrahmanyam




Disclaimer:  This message and the information contained herein is proprietary 
and confidential and subject to the Tech Mahindra policy statement, you may 
review the policy at http://www.techmahindra.com/Disclaimer.html externally 
http://tim.techmahindra.com/tim/disclaimer.html internally within TechMahindra.




Re: 1, 2, 3...

2016-04-11 Thread Emīls Šolmanis
You're not mistaken, just thought you were after partition keys and didn't
read the question that carefully. Afaik, you're SOOL if you need to
distinguish clustering keys as unique. Well, other than doing a full table
scan of course, which I'm assuming is not too plausible.

On Mon, 11 Apr 2016 at 16:52 Jack Krupansky 
wrote:

> Unless I'm mistaken, nodetool tablestats gives you the number of
> partitions (partition keys), not the number of primary keys. IOW, the term
> "keys" is ambiguous. That's why I phrased the original question as count of
> (CQL) rows, to distinguish from the pre-CQL3 concept of a partition being
> treated as a single row.
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 11:46 AM, Emīls Šolmanis  > wrote:
>
>> Wouldn't the "number of keys" part of *nodetool cfstats* run on every
>> node, summed and divided by replication factor give you a decent
>> approximation? Or are you really after a completely precise number?
>>
>> On Mon, 11 Apr 2016 at 16:18 Jack Krupansky 
>> wrote:
>>
>>> Agreed, that anything requiring a full table scan, short of batch
>>> analytics,is an antipattern, although the goal is not to do a full scan per
>>> se, but just get the row count. It still surprises people that Cassandra
>>> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
>>> query for q=*:* and that will very quickly return the total row count. I
>>> presume that Stratio will handle this fine as well.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Apr 11, 2016 at 11:10 AM,  wrote:
>>>
 Cassandra is not good for table scan type queries (which count(*)
 typically is). While there are some attempts to do that (as noted below),
 this is a path I avoid.





 Sean Durity



 *From:* Max C [mailto:mc_cassan...@core43.com]
 *Sent:* Saturday, April 09, 2016 6:19 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: 1, 2, 3...



 Looks like this guy (Brian Hess) wrote a script to split the token
 range and run count(*) on each subrange:



 https://github.com/brianmhess/cassandra-count



 - Max



 On Apr 8, 2016, at 10:56 pm, Jeff Jirsa 
 wrote:



 SELECT COUNT(*) probably works (with internal paging) on many datasets
 with enough time and assuming you don’t have any partitions that will kill
 you.



 No, it doesn’t count extra replicas / duplicates.



 The old way to do this (before paging / fetch size) was to use manual
 paging based on tokens/clustering keys:



 https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
 SELECT’s WHERE clause can use token(), which is what you’d want to use to
 page through the whole token space.



 You could, in theory, issue thousands of queries in parallel, all for
 different token ranges, and then sum the results. That’s what something
 like spark would be doing. If you want to determine rows per node, limit
 the token range to that owned by the node (easier with 1 token than vnodes,
 with vnodes repeat num_tokens times).



 --

 The information in this Internet Email is confidential and may be
 legally privileged. It is intended solely for the addressee. Access to this
 Email by anyone else is unauthorized. If you are not the intended
 recipient, any disclosure, copying, distribution or any action taken or
 omitted to be taken in reliance on it, is prohibited and may be unlawful.
 When addressed to our clients any opinions or advice contained in this
 Email are subject to the terms and conditions expressed in any applicable
 governing The Home Depot terms of business or client engagement letter. The
 Home Depot disclaims all responsibility and liability for the accuracy and
 content of this attachment and for any damages or losses arising from any
 inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
 items of a destructive nature, which may be contained in this attachment
 and shall not be liable for direct, indirect, consequential or special
 damages in connection with this e-mail message or its attachment.

>>>
>>>
>


Re: 1, 2, 3...

2016-04-11 Thread Jack Krupansky
Unless I'm mistaken, nodetool tablestats gives you the number of partitions
(partition keys), not the number of primary keys. IOW, the term "keys" is
ambiguous. That's why I phrased the original question as count of (CQL)
rows, to distinguish from the pre-CQL3 concept of a partition being treated
as a single row.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 11:46 AM, Emīls Šolmanis 
wrote:

> Wouldn't the "number of keys" part of *nodetool cfstats* run on every
> node, summed and divided by replication factor give you a decent
> approximation? Or are you really after a completely precise number?
>
> On Mon, 11 Apr 2016 at 16:18 Jack Krupansky 
> wrote:
>
>> Agreed, that anything requiring a full table scan, short of batch
>> analytics,is an antipattern, although the goal is not to do a full scan per
>> se, but just get the row count. It still surprises people that Cassandra
>> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
>> query for q=*:* and that will very quickly return the total row count. I
>> presume that Stratio will handle this fine as well.
>>
>>
>> -- Jack Krupansky
>>
>> On Mon, Apr 11, 2016 at 11:10 AM,  wrote:
>>
>>> Cassandra is not good for table scan type queries (which count(*)
>>> typically is). While there are some attempts to do that (as noted below),
>>> this is a path I avoid.
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Max C [mailto:mc_cassan...@core43.com]
>>> *Sent:* Saturday, April 09, 2016 6:19 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: 1, 2, 3...
>>>
>>>
>>>
>>> Looks like this guy (Brian Hess) wrote a script to split the token range
>>> and run count(*) on each subrange:
>>>
>>>
>>>
>>> https://github.com/brianmhess/cassandra-count
>>>
>>>
>>>
>>> - Max
>>>
>>>
>>>
>>> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa 
>>> wrote:
>>>
>>>
>>>
>>> SELECT COUNT(*) probably works (with internal paging) on many datasets
>>> with enough time and assuming you don’t have any partitions that will kill
>>> you.
>>>
>>>
>>>
>>> No, it doesn’t count extra replicas / duplicates.
>>>
>>>
>>>
>>> The old way to do this (before paging / fetch size) was to use manual
>>> paging based on tokens/clustering keys:
>>>
>>>
>>>
>>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
>>> SELECT’s WHERE clause can use token(), which is what you’d want to use to
>>> page through the whole token space.
>>>
>>>
>>>
>>> You could, in theory, issue thousands of queries in parallel, all for
>>> different token ranges, and then sum the results. That’s what something
>>> like spark would be doing. If you want to determine rows per node, limit
>>> the token range to that owned by the node (easier with 1 token than vnodes,
>>> with vnodes repeat num_tokens times).
>>>
>>>
>>>
>>> --
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of business or client engagement letter. The
>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>> content of this attachment and for any damages or losses arising from any
>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>> items of a destructive nature, which may be contained in this attachment
>>> and shall not be liable for direct, indirect, consequential or special
>>> damages in connection with this e-mail message or its attachment.
>>>
>>
>>


Re: 1, 2, 3...

2016-04-11 Thread Emīls Šolmanis
Wouldn't the "number of keys" part of *nodetool cfstats* run on every node,
summed and divided by replication factor give you a decent approximation?
Or are you really after a completely precise number?

On Mon, 11 Apr 2016 at 16:18 Jack Krupansky 
wrote:

> Agreed, that anything requiring a full table scan, short of batch
> analytics,is an antipattern, although the goal is not to do a full scan per
> se, but just get the row count. It still surprises people that Cassandra
> cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
> query for q=*:* and that will very quickly return the total row count. I
> presume that Stratio will handle this fine as well.
>
>
> -- Jack Krupansky
>
> On Mon, Apr 11, 2016 at 11:10 AM,  wrote:
>
>> Cassandra is not good for table scan type queries (which count(*)
>> typically is). While there are some attempts to do that (as noted below),
>> this is a path I avoid.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Max C [mailto:mc_cassan...@core43.com]
>> *Sent:* Saturday, April 09, 2016 6:19 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: 1, 2, 3...
>>
>>
>>
>> Looks like this guy (Brian Hess) wrote a script to split the token range
>> and run count(*) on each subrange:
>>
>>
>>
>> https://github.com/brianmhess/cassandra-count
>>
>>
>>
>> - Max
>>
>>
>>
>> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa 
>> wrote:
>>
>>
>>
>> SELECT COUNT(*) probably works (with internal paging) on many datasets
>> with enough time and assuming you don’t have any partitions that will kill
>> you.
>>
>>
>>
>> No, it doesn’t count extra replicas / duplicates.
>>
>>
>>
>> The old way to do this (before paging / fetch size) was to use manual
>> paging based on tokens/clustering keys:
>>
>>
>>
>> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
>> SELECT’s WHERE clause can use token(), which is what you’d want to use to
>> page through the whole token space.
>>
>>
>>
>> You could, in theory, issue thousands of queries in parallel, all for
>> different token ranges, and then sum the results. That’s what something
>> like spark would be doing. If you want to determine rows per node, limit
>> the token range to that owned by the node (easier with 1 token than vnodes,
>> with vnodes repeat num_tokens times).
>>
>>
>>
>> --
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
>


unsubscribe

2016-04-11 Thread Scott Thompson


Scott Thompson




  
This message and any attached documents are only for the use of the intended 
recipient(s), are confidential and may contain privileged information. Any 
unauthorized review, use, retransmission, or other disclosure is strictly 
prohibited. If you have received this message in error, please notify the 
sender immediately, and then delete the original message. Thank you.


Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Jack Krupansky
Sorry, but your message is too confusing - you say "reading dynamic columns
in CQL" and "make the table schema less", but neither has any relevance to
CQL! 1. CQL tables always have schemas. 2. All columns in CQL are
statically declared (even maps/collections are statically declared
columns.) Granted, it is a challenge for Thrift users to get used to the
terminology of CQL, but it is required. If necessary, review some of the
free online training videos for data modeling.

Unless your data model is very simply and does directly translate into CQL,
you probably do need to bite the bullet and re-model your data to exploit
the features of CQL rather than fight CQL trying to mimic Thrift per se.

In any case, take another shot at framing the problem and then maybe people
here can help you out.

-- Jack Krupansky

On Mon, Apr 11, 2016 at 10:39 AM, Anuj Wadehra 
wrote:

> Any comments or suggestions on this one?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> 
>
> On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra
>  wrote:
> Hi
>
> We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but
> facing some challenges.
>
> We have a cf with a mix of statically defined columns and dynamic columns
> (created at run time). For reading dynamic columns in CQL,
> we have two options:
>
> 1. Drop all columns and make the table schema less. This way, we will get
> a Cql row for each column defined for a row key--As mentioned here:
> http://www.datastax.com/dev/blog/thrift-to-cql3
>
> 2.Migrate entire data to a new non compact storage table and create
> collections for dynamic columns in new table.
>
> In our case, we have observed that approach 2 causes 3 times slower
> performance in Range scan queries used by Spark. This is not acceptable.
> Cassandra 3 has optimized storage engine but we are not comfortable moving
> to 3.x in production.
>
> Moreover, data migration to new table using Spark takes hours.
>
> Any suggestions for the two issues?
>
>
> Thanks
> Anuj
>
>
> Sent from Yahoo Mail on Android
> 
>
>


Re: 1, 2, 3...

2016-04-11 Thread Jack Krupansky
Agreed, that anything requiring a full table scan, short of batch
analytics,is an antipattern, although the goal is not to do a full scan per
se, but just get the row count. It still surprises people that Cassandra
cannot quickly get COUNT(*). The easy answer: Use DSE Search and do a Solr
query for q=*:* and that will very quickly return the total row count. I
presume that Stratio will handle this fine as well.


-- Jack Krupansky

On Mon, Apr 11, 2016 at 11:10 AM,  wrote:

> Cassandra is not good for table scan type queries (which count(*)
> typically is). While there are some attempts to do that (as noted below),
> this is a path I avoid.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Max C [mailto:mc_cassan...@core43.com]
> *Sent:* Saturday, April 09, 2016 6:19 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: 1, 2, 3...
>
>
>
> Looks like this guy (Brian Hess) wrote a script to split the token range
> and run count(*) on each subrange:
>
>
>
> https://github.com/brianmhess/cassandra-count
>
>
>
> - Max
>
>
>
> On Apr 8, 2016, at 10:56 pm, Jeff Jirsa 
> wrote:
>
>
>
> SELECT COUNT(*) probably works (with internal paging) on many datasets
> with enough time and assuming you don’t have any partitions that will kill
> you.
>
>
>
> No, it doesn’t count extra replicas / duplicates.
>
>
>
> The old way to do this (before paging / fetch size) was to use manual
> paging based on tokens/clustering keys:
>
>
>
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html –
> SELECT’s WHERE clause can use token(), which is what you’d want to use to
> page through the whole token space.
>
>
>
> You could, in theory, issue thousands of queries in parallel, all for
> different token ranges, and then sum the results. That’s what something
> like spark would be doing. If you want to determine rows per node, limit
> the token range to that owned by the node (easier with 1 token than vnodes,
> with vnodes repeat num_tokens times).
>
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


RE: 1, 2, 3...

2016-04-11 Thread SEAN_R_DURITY
Cassandra is not good for table scan type queries (which count(*) typically 
is). While there are some attempts to do that (as noted below), this is a path 
I avoid.


Sean Durity

From: Max C [mailto:mc_cassan...@core43.com]
Sent: Saturday, April 09, 2016 6:19 PM
To: user@cassandra.apache.org
Subject: Re: 1, 2, 3...

Looks like this guy (Brian Hess) wrote a script to split the token range and 
run count(*) on each subrange:

https://github.com/brianmhess/cassandra-count

- Max

On Apr 8, 2016, at 10:56 pm, Jeff Jirsa 
> wrote:

SELECT COUNT(*) probably works (with internal paging) on many datasets with 
enough time and assuming you don’t have any partitions that will kill you.

No, it doesn’t count extra replicas / duplicates.

The old way to do this (before paging / fetch size) was to use manual paging 
based on tokens/clustering keys:

https://docs.datastax.com/en/cql/3.1/cql/cql_using/paging_c.html – SELECT’s 
WHERE clause can use token(), which is what you’d want to use to page through 
the whole token space.

You could, in theory, issue thousands of queries in parallel, all for different 
token ranges, and then sum the results. That’s what something like spark would 
be doing. If you want to determine rows per node, limit the token range to that 
owned by the node (easier with 1 token than vnodes, with vnodes repeat 
num_tokens times).




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Migrating to CQL and Non Compact Storage

2016-04-11 Thread Anuj Wadehra
Any comments or suggestions on this one? 
ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Sun, 10 Apr, 2016 at 11:39 PM, Anuj Wadehra wrote: 
  Hi
We are on 2.0.14 and Thrift. We are planning to migrate to CQL soon but facing 
some challenges.
We have a cf with a mix of statically defined columns and dynamic columns 
(created at run time). For reading dynamic columns in CQL, we have two options:
1. Drop all columns and make the table schema less. This way, we will get a Cql 
row for each column defined for a row key--As mentioned here: 
http://www.datastax.com/dev/blog/thrift-to-cql3
2.Migrate entire data to a new non compact storage table and create collections 
for dynamic columns in new table.
In our case, we have observed that approach 2 causes 3 times slower performance 
in Range scan queries used by Spark. This is not acceptable. Cassandra 3 has 
optimized storage engine but we are not comfortable moving to 3.x in production.
Moreover, data migration to new table using Spark takes hours. 

Any suggestions for the two issues?

ThanksAnuj

Sent from Yahoo Mail on Android  


[RELEASE] Apache Cassandra 3.0.5 released

2016-04-11 Thread Jake Luciani
The Cassandra team is pleased to announce the release of Apache Cassandra
version 3.0.5.

Apache Cassandra is a fully distributed database. It is the right choice
when you need scalability and high availability without compromising
performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

 http://cassandra.apache.org/download/

This version is a bug fix release[1] on the 3.0 series. As always, please
pay
attention to the release notes[2] and Let us know[3] if you were to
encounter
any problem.

Enjoy!

[1]: http://goo.gl/tlNv8g (CHANGES.txt)
[2]: http://goo.gl/WrCSKw (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA


Re: Latency overhead on Cassandra cluster deployed on multiple AZs (AWS)

2016-04-11 Thread Chris Lohfink
Where do you get the ~1ms latency between AZs? Comparing a short term
average to a 99th percentile isn't very fair.

"Over the last month, the median is 2.09 ms, 90th percentile is 20ms,
99th percentile
is 47ms." - per
https://www.quora.com/What-are-typical-ping-times-between-different-EC2-availability-zones-within-the-same-region

Are you using EBS? That would further impact latency on reads and GCs will
always cause hiccups in the 99th+.

Chris


On Mon, Apr 11, 2016 at 7:57 AM, Alessandro Pieri  wrote:

> Hi everyone,
>
> Last week I ran some tests to estimate the latency overhead introduces in
> a Cassandra cluster by a multi availability zones setup on AWS EC2.
>
> I started a Cassandra cluster of 6 nodes deployed on 3 different AZs (2
> nodes/AZ).
>
> Then, I used cassandra-stress to create an INSERT (write) test of 20M
> entries with a replication factor = 3, right after, I ran cassandra-stress
> again to READ 10M entries.
>
> Well, I got the following unexpected result:
>
> Single-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.06ms/7.41ms/55.81ms
> Multi-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.16ms/38.14ms/47.75ms
>
> Basically, switching to the multi-AZ setup the latency increased of ~30ms.
> That's too much considering the the average network latency between AZs on
> AWS is ~1ms.
>
> Since I couldn't find anything to explain those results, I decided to run
> the cassandra-stress specifying only a single node entry (i.e. "--nodes
> node1" instead of "--nodes node1,node2,node3,node4,node5,node6") and
> surprisingly the latency went back to 5.9 ms.
>
> Trying to recap:
>
> Multi-AZ, CL=ONE, "--nodes node1,node2,node3,node4,node5,node6" -> 95th
> percentile: 38.14ms
> Multi-AZ, CL=ONE, "--nodes node1" -> 95th percentile: 5.9ms
>
> For the sake of completeness I've ran a further test using a consistency
> level = LOCAL_QUORUM and the test did not show any large variance with
> using a single node or multiple ones.
>
> Do you guys know what could be the reason?
>
> The test were executed on a m3.xlarge (network optimized) using the
> DataStax AMI 2.6.3 running Cassandra v2.0.15.
>
> Thank you in advance for your help.
>
> Cheers,
> Alessandro
>


unsubscribe

2016-04-11 Thread Vitaly Sourikov
unsubscribe


RE: all the nost are not reacheable when running massive deletes

2016-04-11 Thread Paco Trujillo
Thanks Alain for all your answer:


-  In a few days I am going to set up a maintenance window so I can 
test again to run repairs and see what happens. Definitely I will run 'iostat 
-mx 5 100' On that time and also use the command you pointed to see why is 
consuming so much power.

-  About the client configuration, we had QUORUM because we were 
planning to have another data center last year (running in the locations of one 
of our clients) but at the end we postponed that. The configuration is still 
the same :), thanks for the indication. We used the downgrading policy because 
of the timeouts, and problems we had in the past with the network. In fact I 
have not seen in the logs for some months that the downgrading is occurring, so 
probably is good also to remove It from the configuration.

-  The secondary index in the cf is definitely a bad decision, taking 
at the beginning when I start getting familiar with Cassandra. The problem is 
the cf at this moment have a lot of data, and remodel it will cost some time, 
so we decide to postpone it. There are some queries which use this index,  
using materialized views on this cf and other related with it, will solve the 
problem. But for that, I need to update the cluster ☺

-  Good that you mention that LCS will not be a good idea, because I 
will planning to make a snapshot of that cf and restore the data in our test 
cluster to see if the LCS compaction will help. It was more a decision based on 
“I have to try something” than based on arguments ☺



From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
Sent: vrijdag 8 april 2016 12:46
To: user@cassandra.apache.org
Subject: Re: all the nost are not reacheable when running massive deletes

It looks like a complex issue that might worth having a close look at your data 
model, configurations and machines.

It is hard to help you from the mailing list. Yet here are some thoughts, some 
might be irrelevant or wrong, but some other might point you to your issue, 
hope we will get lucky there :-):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1,000,000,400,030,00   98,57

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda   0,00 0,000,000,20 0,00 0,00 8,00 
0,000,00   0,00   0,00
sdb   0,00 0,000,000,00 0,00 0,00 0,00 
0,000,00   0,00   0,00
sdc   0,00 0,000,000,00 0,00 0,00 0,00 
0,000,00   0,00   0,00
sdd   0,00 0,200,000,40 0,00 0,0012,00 
0,002,50   2,50   0,10

CPU:


-  General use: 1 – 4 %

-  Worst case: 98% .It is when the problem comes, running massive 
deletes(even in a different machine which is receiving the deletes) or running 
a repair.

First, the cluster is definitely not overloaded. You are having an issue with 
some nodes from time to time. This looks like an imbalanced cluster. It can be 
due to some wide rows or bad partition key. Make sure writes are well balanced 
at any time with the partition you are using and try to spot some warnings 
about large row compactions in the logs. Yet, I don't think this is what you 
face as you then should have 2 or 3 nodes going crazy at the same time because 
of RF (2 or 3).

Also, can we have an 'iostat -mx 5 100' on when a node goes mad?
An other good troubleshooting tool would be using 
https://github.com/aragozin/jvm-tools/blob/master/sjk-core/COMMANDS.md#ttop-command.
 It would be interesting to see what Cassandra threads are consuming the CPU 
power. This is definitely something I would try on a high load node/time.


About the client, some comments, clearly unrelated to your issue, but probably 
worth it to be told:

.setConsistencyLevel(ConsistencyLevel.QUORUM))
 [...]
 .withRetryPolicy(new 
LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))

I advice people to never do this. Basically, consistency level means: even in 
the worst case, I want to make sure that at least (RF / 2) + 1 got the read / 
write to consider it valid, if not drop the operation. If used for both writes 
& reads, this provide you a strong and 'immediate' consistency (no locks 
though, so excepted for some races). Data will always be sent to all the nodes 
in charge of the token (generally 2 or 3 nodes, depending on RF).

Then you say, if I can't have quorum, then go for one. Meaning you prefer 
availability, rather than consistency. Then, why not use one from the start as 
the consistency level? I would go for CL ONE or remove the 
'DowngradingConsistencyRetryPolicy'.

Also, I would go with 'LOCAL_ONE/QUORUM', using Local is not an issue when 
using only one DC as you do, but avoid some surprises when adding a new DC. If 
you don't change it, keep it in mind for the day you add a new DC.

Yet, this client does a probably well balanced use of 

Re: Data modelling, including cleanup

2016-04-11 Thread Bo Finnerup Madsen
Hi Hannu,

Thank you for the pointer. We ended up using materialized views in
cassandra 3.0.3. Seems to do the trick :)


tor. 17. mar. 2016 kl. 11.16 skrev Hannu Kröger :

> Hi,
>
> That’s how I have done it in many occasions. Nowadays there is the
> possibility use Cassandra 3.0 and materialised views so that you don’t need
> to keep two tables up to date manually:
> http://www.datastax.com/dev/blog/new-in-cassandra-3-0-materialized-views
>
> Hannu
>
> On 17 Mar 2016, at 12:05, Bo Finnerup Madsen 
> wrote:
>
> Hi,
>
> We are pretty new to data modelling in cassandra, and are having a bit of
> a challenge creating a model that caters both for queries and updates.
>
> Let me try to explain it using the users example from
> http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
>
> They define two tables used for reading users, one by username and one by
> email.
> -
> CREATE TABLE users_by_username (
> username text PRIMARY KEY,
> email text,
> age int
> )
>
> CREATE TABLE users_by_email (
> email text PRIMARY KEY,
> username text,
> age int
> )
> -
>
> Now lets pretend that we need to delete a user, and we are given a
> username as a key. Would the correct procedure be:
> 1) Read the email from users_by_username using the username as a key
> 2) Delete from users_by_username using the username as a key
> 3) Delete from users_by_email using the email as a key
>
> Or is there a smarter way of doing this?
>
> Yours sincerely,
>   Bo Madsen
>
>
>