Re: Lot of GC on two nodes out of 7

2016-03-01 Thread Jeff Jirsa
Compaction falling behind will likely cause additional work on reads (more 
sstables to merge), but I’d be surprised if it manifested in super long GC. 
When you say twice as many sstables, how many is that?. 

In cfstats, does anything stand out? Is max row size on those nodes larger than 
on other nodes?

What you don’t show in your JVM options is the new gen size – if you do have 
unusually large partitions on those two nodes (especially likely if you have 
rf=2 – if you have rf=3, then there’s probably a third node misbehaving you 
haven’t found yet), then raising new gen size can help handle the garbage 
created by reading large partitions without having to tolerate the promotion. 
Estimates for the amount of garbage vary, but it could be “gigabytes” of 
garbage on a very wide partition (see 
https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress to 
help mitigate that type of pain).

- Jeff 

From:  Anishek Agarwal
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, March 1, 2016 at 11:12 PM
To:  "user@cassandra.apache.org"
Subject:  Lot of GC on two nodes out of 7

Hello, 

we have a cassandra cluster of 7 nodes, all of them have the same JVM GC 
configurations, all our writes /  reads use the TokenAware Policy wrapping a 
DCAware policy. All nodes are part of same Datacenter.

We are seeing that two nodes are having high GC collection times. Then mostly 
seem to spend time in GC like about 300-600 ms. This also seems to result in 
higher CPU utilisation on these machines. Other  5 nodes don't have this 
problem.

There is no additional repair activity going on the cluster, we are not sure 
why this is happening. 
we checked cfhistograms on the two CF we have in the cluster and number of 
reads seems to be almost same. 

we also used cfstats to see the number of ssttables on each node and one of the 
nodes with the above problem has twice the number of ssttables than other 
nodes. This still doesnot explain why two nodes have high GC Overheads. our GC 
config is as below:
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072 = 32768 * 4

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"

#new 

JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"


We are using cassandra 2.0.17. If anyone has any suggestion as to how what else 
we can look for to understand why this is happening please do reply. 



Thanks
anishek





smime.p7s
Description: S/MIME cryptographic signature


Lot of GC on two nodes out of 7

2016-03-01 Thread Anishek Agarwal
Hello,

we have a cassandra cluster of 7 nodes, all of them have the same JVM GC
configurations, all our writes /  reads use the TokenAware Policy wrapping
a DCAware policy. All nodes are part of same Datacenter.

We are seeing that two nodes are having high GC collection times. Then
mostly seem to spend time in GC like about 300-600 ms. This also seems to
result in higher CPU utilisation on these machines. Other  5 nodes don't
have this problem.

There is no additional repair activity going on the cluster, we are not
sure why this is happening.
we checked cfhistograms on the two CF we have in the cluster and number of
reads seems to be almost same.

we also used cfstats to see the number of ssttables on each node and one of
the nodes with the above problem has twice the number of ssttables than
other nodes. This still doesnot explain why two nodes have high GC
Overheads. our GC config is as below:

JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"

JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"

JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"

JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"

JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50"

JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"

JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"

JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"

JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"

JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"

JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"

JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"

JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"

JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"

JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"

# earlier value 131072 = 32768 * 4

JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072"

JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"

JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768"

JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768"

#new

JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"

We are using cassandra 2.0.17. If anyone has any suggestion as to how what
else we can look for to understand why this is happening please do reply.



Thanks
anishek


Re: List of List

2016-03-01 Thread Sandeep Kalra
Thanks Everyone. I am not using thrift. I am reading CQL and understanding
to use it.


Best Regards,
Sandeep Kalra


On Tue, Mar 1, 2016 at 9:51 PM, Dani Traphagen 
wrote:

> Hey Sandeep,
>
> It's good to understand why using Thrift isn't a good idea so I'll help
> with that. You'll mostly hear people say RUN AWAY FROM THRIFT WITH THE
> MIGHTY STRIDE OF A GAZELLE. The reason why is that it's old and not
> supported. You'll end up with a broken pile of parts and you definitely
> don't want that. Forming a bad habit while learning with it isn't good
> either.
>
> Thrift was fine for it's purpose at the time, but it's convoluted and
> frankly, CQL is so much more mature at this point that you are able to get
> more meaningful information using that interface.
>
> Please don't use Thrift.
>
> Please.
>
> Seriously.
>
> Good luck,
> Dani
>
> On Tue, Mar 1, 2016 at 3:59 PM, Robert Coli  wrote:
>
>> On Tue, Mar 1, 2016 at 3:23 PM, Jonathan Haddad 
>> wrote:
>>
>>> Thrift is deprecated, and will be removed in Cassandra 4.0  Don't do any
>>> new development with it.
>>>
>>
>> +infinity this.
>>
>> =Rob
>>
>>
>
>
>
> --
> [image: datastax_logo.png] 
>
> DANI TRAPHAGEN
>
> Technical Enablement Lead | dani.trapha...@datastax.com
>
> [image: twitter.png]  [image:
> linkedin.png] 
> 
>


Re: Commit log size vs memtable total size

2016-03-01 Thread Vlad
Tyler, thanks for explanation!
So commit segment can contain both data from flushed table A and non-flushed 
table B.How is it replayed on start up? Does C* skip portions belonging to 
table A that already were written to SSTable?
Regards, Vlad
 

On Tuesday, March 1, 2016 11:37 PM, Tyler Hobbs  wrote:
 

 
On Tue, Mar 1, 2016 at 6:13 AM, Vlad  wrote:

So commit log can't keep more than memtable size, why is difference in commit 
log and memtables sizes?

In order to purge a commitlog segment, all memtables that contain data from 
that segment must be flushed to disk.

Suppose you have two tables:
 - table A has extremely high throughput
 - table B has low throughput

Every commitlog segment will have a mixture of writes for table A and table B.  
The memtable for table A will fill up rapidly and will be flushed frequently.  
The memtable for table B will slowly filly up, and will not be flushed often.  
Since table B's memtable isn't flushed, none of the commit log segments can 
purged/recycled.  Once the commitlog hits its size limit, it will force a flush 
of table B.

This behavior is good, because it allows table B to be flushed in large chunks 
instead of hundreds of tiny sstables.  If the commitlog space were equal to the 
memtable space, Cassandra would have to force a flush of table B's memtable 
approximately every time table A is flushed, despite being much smaller.

To summarize: if you use more than one table, it makes sense to have a larger 
space for commitlog segments.

-- 
Tyler Hobbs
DataStax


  

Re: List of List

2016-03-01 Thread Dani Traphagen
Hey Sandeep,

It's good to understand why using Thrift isn't a good idea so I'll help
with that. You'll mostly hear people say RUN AWAY FROM THRIFT WITH THE
MIGHTY STRIDE OF A GAZELLE. The reason why is that it's old and not
supported. You'll end up with a broken pile of parts and you definitely
don't want that. Forming a bad habit while learning with it isn't good
either.

Thrift was fine for it's purpose at the time, but it's convoluted and
frankly, CQL is so much more mature at this point that you are able to get
more meaningful information using that interface.

Please don't use Thrift.

Please.

Seriously.

Good luck,
Dani

On Tue, Mar 1, 2016 at 3:59 PM, Robert Coli  wrote:

> On Tue, Mar 1, 2016 at 3:23 PM, Jonathan Haddad  wrote:
>
>> Thrift is deprecated, and will be removed in Cassandra 4.0  Don't do any
>> new development with it.
>>
>
> +infinity this.
>
> =Rob
>
>



-- 
[image: datastax_logo.png] 

DANI TRAPHAGEN

Technical Enablement Lead | dani.trapha...@datastax.com

[image: twitter.png]  [image: linkedin.png]




Re: Practical limit on number of column families

2016-03-01 Thread Jack Krupansky
It is the total table count, across all key spaces. Memory is memory.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 6:26 PM, Brian Sam-Bodden 
wrote:

> Eric,
>   Is the keyspace as a multitenancy solution as bad as the many tables
> pattern? Is the memory overhead of keyspaces as heavy as that of tables?
>
> Cheers,
> Brian
>
>
> On Tuesday, March 1, 2016, Eric Stevens  wrote:
>
>> It's definitely not true for every use case of a large number of tables,
>> but for many uses where you'd be tempted to do that, adding whatever would
>> have driven your table naming instead as a column in your partition key on
>> a smaller number of tables will meet your needs.  This is especially true
>> if you're looking to solve multi-tenancy, unless you let your tenants
>> dynamically drive your schema (which is a separate can of worms).
>>
>> On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky 
>> wrote:
>>
>>> I don't think Cassandra was "purposefully developed" for some target
>>> number of tables - there is no evidence of any such an explicit intent.
>>> Instead, it would be fair to say that Cassandra was "not purposefully
>>> developed" with a goal of supporting "large numbers of tables." Sometimes
>>> features and capabilities come for free or as a side effect of the
>>> technologies used, but usually specific features and specific capabilities
>>> (such as large numbers of tables) require explicit intent and explicit
>>> effort.
>>>
>>> One could indeed endeavor to design a data store (I'm not even sure it
>>> would still be considered a database per se) that supported either large
>>> numbers of tables or an additional level of storage model in between table
>>> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
>>> not designed with that goal in mind.
>>>
>>> Traditionally, a "table" is a defined relation over a set of data.
>>> Relation and data are distinct concepts. And a relation name is not simply
>>> a Java-style "object". A relation (table) name is supposed to represent an
>>> abstraction or entity type, while essentially all of the cases I have heard
>>> of for wanting thousands (or even hundreds) of tables are trying to use
>>> table as more of a container for a group of rows for a specific entity
>>> instance rather than a distinct entity type. Granted, Cassandra is not
>>> obligated to be limited to the relational model, but Cassandra, especially
>>> CQL, is intentionally modeled reasonably closely with the relational model
>>> in terms of the data modeling abstractions even though the storage engine
>>> is designed to scale across nodes.
>>>
>>> You could file a Jira requesting such a feature improvement. And then we
>>> would see if sentiment has shifted over the years.
>>>
>>> The key thing is to offer up a use case that warrants support for large
>>> numbers of tables. So far, it has usually been the case that the perceived
>>> need for separate tables could easily be met using clustering columns of a
>>> single table.
>>>
>>> Seriously, if you guys can define a legitimate use case that can't
>>> easily be handled by a single table, that could get the discussion started.
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
>>> fernando.jime...@wealth-port.com> wrote:
>>>
 Hi Jack

 Being purposefully developed to only handle up to “a few hundred”
 tables is reason enough. I accept that, and likely a use case with many
 tables was never really considered. But I would still like to understand
 the design choices made so perhaps we gain some confidence level in this
 upper limit in the number of tables. The best estimate we have so far is “a
 few hundred” which is a bit vague.

 Regarding scaling, I’m not talking about scaling in terms of data
 volume, but on how the data is structured. One thousand tables with one row
 each is the same data volume as one table with one thousand rows, excluding
 any data structures required to maintain the extra tables. But whereas the
 first seems likely to bring a Cassandra cluster to its knees, the second
 will run happily on a single node cluster in a low end machine.

 We will design our code to use a single table to avoid having
 nightmares with this issue. But if there is any authoritative documentation
 on this characteristic of Cassandra, I would love to know more.

 FJ


 On 01 Mar 2016, at 14:23, Jack Krupansky 
 wrote:

 I don't think there are any "reasons behind it." It is simply empirical
 experience - as reported here.

 Cassandra scales in two dimension - number of rows per node and number
 of nodes. If some source of information lead you to believe otherwise,
 please point out the source so that we can endeavor to correct it.

 The exact number of rows per node and tables 

DATA replication from Oracle DB to Cassandra

2016-03-01 Thread anil_ah


Hi    I want to run spark job to do incremental sync from oracle to 
cassandra,job interval could be one minute.we are looking for a real time 
replication with latency of 1 or 2 min.
Please advise  what would be best Approch
1)oracle db->spark sql ->spark->cassandra.2)oracle db ->sqoop->cassandra 
Please advise which option is good in term of scalable,incremental etc
Regards Anil


Sent from my Samsung device

Broken links in Apache Cassandra home page

2016-03-01 Thread ANG ANG
Reference:
http://stackoverflow.com/questions/35712166/broken-links-in-apache-cassandra-home-page/35724686#35724686

The following links are broken in the Apache Cassandra Home/Welcome page:

   1. "materialized views":
   http://www.datastax.com/dev/blog/new-in-cassandra-3-0-materialized-views
   2. "#cassandra channel": http://freenode.net/

Is this the right forum to notify the community about this type of issues
(e.g., outdated documentation, broken links)?

Thanks


Fwd: DATA replication from Oracle DB to Cassandra

2016-03-01 Thread anil_ah


 Original message 
From: anil_ah  
Date: 03/02/2016  9:11 am  (GMT+08:00) 
To: User Cassandra  
Subject: DATA replication from Oracle DB to Cassandra 



Hi    I want to run spark job to do incremental sync from oracle to 
cassandra,job interval could be one minute.we are looking for a real time 
replication with latency of 1 or 2 min.
Please advise  what would be best Approch
1)oracle db->spark sql ->spark->cassandra.2)oracle db ->sqoop->cassandra 
Please advise which option is good in term of scalable,incremental etc
Regards Anil


Sent from my Samsung device

Re: Querying on index

2016-03-01 Thread Jonathan Haddad
That feels like a serious bug.  Definitely file a JIRA with as many details
as possible.  https://issues.apache.org/jira/browse/CASSANDRA/



On Tue, Mar 1, 2016 at 4:38 PM Rakesh Kumar  wrote:

> Looks like Bloom filter size was the issue. Once I disabled it, the query
> returns rows correctly, but it was terrible slow (expected since it will
> hit SStable every time).
>
>
> -Original Message-
> From: Rakesh Kumar 
> To: user 
> Sent: Tue, Mar 1, 2016 4:57 pm
> Subject: Re: Querying on index
>
>
> At this time no one else is using this table. So the data is static.
>
> -Original Message-
> From: Rakesh Kumar
> To: user
> Sent: Tue, Mar 1, 2016 4:54 pm
> Subject: Querying on index
>
> Cassandra: 3.3On my test system I create a tablecreate table eventinput(
> event_id varchar , event_class_cd int , event_ts timestamp , client_id
> varchar , event_message text , primary key ((client_id,event_id),event_ts))
> I created an index on client_id create index idx1 on
> eventinput(client_id);When I query select *from eventinputwhere client_id =
> 'aa' ALLOW filtering ;I get random results. One time it is 200, another
> time 400 or 500 or 600 and sometimes 0.Why ?
>


Re: Querying on index

2016-03-01 Thread Rakesh Kumar
Looks like Bloom filter size was the issue. Once I disabled it, the query 
returns rows correctly, but it was terrible slow (expected since it will hit 
SStable every time).



-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Tue, Mar 1, 2016 4:57 pm
Subject: Re: Querying on index




At this time no one else is using this table. So the data is static.

-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Tue, Mar 1, 2016 4:54 pm
Subject: Querying on index

Cassandra: 3.3On my test system I create a tablecreate table eventinput(   
event_id varchar ,event_class_cd int ,event_ts timestamp ,
client_id varchar ,event_message  text ,primary key 
((client_id,event_id),event_ts)) I created an index on client_id create index 
idx1 on eventinput(client_id);When I query select *from eventinputwhere 
client_id = 'aa' ALLOW filtering ;I get random results. One time it is 200, 
another time 400 or 500 or 600 and sometimes 0.Why ?




Re: Snitch for AWS EC2 nondefaultVPC

2016-03-01 Thread Robert Coli
On Tue, Mar 1, 2016 at 12:12 PM, Arun Sandu  wrote:

> All our nodes are launched in AWS EC2 VPC (private). We have 2
> datacenters(1 us-east , 1- asiapacific) and all communication is through
> private IP's and don't have any public IPs. What is the recommended snitch
> to be used? We currently have GossipingPropertyFileSnitch.
>

I recommend using GPFS unless you're absolutely certain you will never want
to rely on any host but Amazon, and you will never want to (for example)
have an analytics pseudo-datacenter within AWS.

=Rob


Re: List of List

2016-03-01 Thread Robert Coli
On Tue, Mar 1, 2016 at 3:23 PM, Jonathan Haddad  wrote:

> Thrift is deprecated, and will be removed in Cassandra 4.0  Don't do any
> new development with it.
>

+infinity this.

=Rob


Re: Practical limit on number of column families

2016-03-01 Thread Brian Sam-Bodden
Eric,
  Is the keyspace as a multitenancy solution as bad as the many tables
pattern? Is the memory overhead of keyspaces as heavy as that of tables?

Cheers,
Brian

On Tuesday, March 1, 2016, Eric Stevens  wrote:

> It's definitely not true for every use case of a large number of tables,
> but for many uses where you'd be tempted to do that, adding whatever would
> have driven your table naming instead as a column in your partition key on
> a smaller number of tables will meet your needs.  This is especially true
> if you're looking to solve multi-tenancy, unless you let your tenants
> dynamically drive your schema (which is a separate can of worms).
>
> On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky  > wrote:
>
>> I don't think Cassandra was "purposefully developed" for some target
>> number of tables - there is no evidence of any such an explicit intent.
>> Instead, it would be fair to say that Cassandra was "not purposefully
>> developed" with a goal of supporting "large numbers of tables." Sometimes
>> features and capabilities come for free or as a side effect of the
>> technologies used, but usually specific features and specific capabilities
>> (such as large numbers of tables) require explicit intent and explicit
>> effort.
>>
>> One could indeed endeavor to design a data store (I'm not even sure it
>> would still be considered a database per se) that supported either large
>> numbers of tables or an additional level of storage model in between table
>> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
>> not designed with that goal in mind.
>>
>> Traditionally, a "table" is a defined relation over a set of data.
>> Relation and data are distinct concepts. And a relation name is not simply
>> a Java-style "object". A relation (table) name is supposed to represent an
>> abstraction or entity type, while essentially all of the cases I have heard
>> of for wanting thousands (or even hundreds) of tables are trying to use
>> table as more of a container for a group of rows for a specific entity
>> instance rather than a distinct entity type. Granted, Cassandra is not
>> obligated to be limited to the relational model, but Cassandra, especially
>> CQL, is intentionally modeled reasonably closely with the relational model
>> in terms of the data modeling abstractions even though the storage engine
>> is designed to scale across nodes.
>>
>> You could file a Jira requesting such a feature improvement. And then we
>> would see if sentiment has shifted over the years.
>>
>> The key thing is to offer up a use case that warrants support for large
>> numbers of tables. So far, it has usually been the case that the perceived
>> need for separate tables could easily be met using clustering columns of a
>> single table.
>>
>> Seriously, if you guys can define a legitimate use case that can't easily
>> be handled by a single table, that could get the discussion started.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com
>> >
>> wrote:
>>
>>> Hi Jack
>>>
>>> Being purposefully developed to only handle up to “a few hundred” tables
>>> is reason enough. I accept that, and likely a use case with many tables was
>>> never really considered. But I would still like to understand the design
>>> choices made so perhaps we gain some confidence level in this upper limit
>>> in the number of tables. The best estimate we have so far is “a few
>>> hundred” which is a bit vague.
>>>
>>> Regarding scaling, I’m not talking about scaling in terms of data
>>> volume, but on how the data is structured. One thousand tables with one row
>>> each is the same data volume as one table with one thousand rows, excluding
>>> any data structures required to maintain the extra tables. But whereas the
>>> first seems likely to bring a Cassandra cluster to its knees, the second
>>> will run happily on a single node cluster in a low end machine.
>>>
>>> We will design our code to use a single table to avoid having nightmares
>>> with this issue. But if there is any authoritative documentation on this
>>> characteristic of Cassandra, I would love to know more.
>>>
>>> FJ
>>>
>>>
>>> On 01 Mar 2016, at 14:23, Jack Krupansky >> > wrote:
>>>
>>> I don't think there are any "reasons behind it." It is simply empirical
>>> experience - as reported here.
>>>
>>> Cassandra scales in two dimension - number of rows per node and number
>>> of nodes. If some source of information lead you to believe otherwise,
>>> please point out the source so that we can endeavor to correct it.
>>>
>>> The exact number of rows per node and tables per node will always have
>>> to be evaluated empirically - a proof of concept 

Re: List of List

2016-03-01 Thread Jonathan Haddad
Thrift is deprecated, and will be removed in Cassandra 4.0  Don't do any
new development with it.

What video says to use thrift?

On Tue, Mar 1, 2016 at 2:29 PM Sandeep Kalra 
wrote:

> I am in very early stage , so, I can change. Infact, the videos you
> pointed also says to do so...
>
>
> Best Regards,
> Sandeep Kalra
>
>
> On Tue, Mar 1, 2016 at 3:58 PM, Jack Krupansky 
> wrote:
>
>> Thrift? Hah! Sorry, I can't help you if you are going that route. I
>> recommend CQL - only.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 4:47 PM, Sandeep Kalra 
>> wrote:
>>
>>> The way I was planning is to give a restful interface to lookup details
>>> of a question, and then user must get complete list of answers and its
>>> comments. I am using thrift interface and node-js to serve it. Search on
>>> questions are  using subject tag and/or its content,
>>>
>>>
>>>
>>> Best Regards,
>>> Sandeep Kalra
>>>
>>>
>>> On Tue, Mar 1, 2016 at 2:49 PM, Jack Krupansky >> > wrote:
>>>
 Okay, so a very large number of questions, each with a very modest
 number of answers (generally under 5), each with a modest number of
 comments (generally under 5).

 Now we're back to the issue of how you wish to query and access the
 data.

 -- Jack Krupansky

 On Tue, Mar 1, 2016 at 12:39 PM, Sandeep Kalra  wrote:

> ​I do not have limit of number of Answers or its comments.​ Assume it
> to be clone of StackOverflow..
>
>
>
> Best Regards,
> Sandeep Kalra
>
>
> On Tue, Mar 1, 2016 at 11:29 AM, Jack Krupansky <
> jack.krupan...@gmail.com> wrote:
>
>> Clustering columns are your friends.
>>
>> But the first question is how you need to query the data. Queries
>> drive data models in Cassandra.
>>
>> What is the cardinality of this data - how many answers per question
>> and how many comments per answer?
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra <
>> sandeep.ka...@gmail.com> wrote:
>>
>>> Hi all.
>>>
>>> I am beginner in Cassandra.
>>>
>>> I am working on Q project where I have to maintain a list of list
>>> for objects.
>>>
>>> For e.g. A Question can have list of Answers, and each Answer can
>>> then have list of Comments.
>>>
>>> --
>>> As of now I have 3 tables. Questions, Answers, and Comments. I have
>>> stored UID of Answers in List for question, and then 
>>> each
>>> answer has List in separate table. [Optionally a 
>>> Comment
>>> may have replies]
>>>
>>> I do multiple queries to find the complete answers-list and then its
>>> related comments.
>>>
>>> This whole thing looks inefficient to me.
>>> --
>>>
>>> Question:
>>> *Is there a better way to do it in Cassandra*. What can I do as far
>>> as re-designing database to have lesser queries.
>>>
>>>
>>>
>>> Best Regards,
>>> Sandeep Kalra
>>>
>>>
>>
>

>>>
>>
>


Re: List of List

2016-03-01 Thread Sandeep Kalra
I am in very early stage , so, I can change. Infact, the videos you pointed
also says to do so...


Best Regards,
Sandeep Kalra


On Tue, Mar 1, 2016 at 3:58 PM, Jack Krupansky 
wrote:

> Thrift? Hah! Sorry, I can't help you if you are going that route. I
> recommend CQL - only.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 4:47 PM, Sandeep Kalra 
> wrote:
>
>> The way I was planning is to give a restful interface to lookup details
>> of a question, and then user must get complete list of answers and its
>> comments. I am using thrift interface and node-js to serve it. Search on
>> questions are  using subject tag and/or its content,
>>
>>
>>
>> Best Regards,
>> Sandeep Kalra
>>
>>
>> On Tue, Mar 1, 2016 at 2:49 PM, Jack Krupansky 
>> wrote:
>>
>>> Okay, so a very large number of questions, each with a very modest
>>> number of answers (generally under 5), each with a modest number of
>>> comments (generally under 5).
>>>
>>> Now we're back to the issue of how you wish to query and access the data.
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 12:39 PM, Sandeep Kalra 
>>> wrote:
>>>
 ​I do not have limit of number of Answers or its comments.​ Assume it
 to be clone of StackOverflow..



 Best Regards,
 Sandeep Kalra


 On Tue, Mar 1, 2016 at 11:29 AM, Jack Krupansky <
 jack.krupan...@gmail.com> wrote:

> Clustering columns are your friends.
>
> But the first question is how you need to query the data. Queries
> drive data models in Cassandra.
>
> What is the cardinality of this data - how many answers per question
> and how many comments per answer?
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra <
> sandeep.ka...@gmail.com> wrote:
>
>> Hi all.
>>
>> I am beginner in Cassandra.
>>
>> I am working on Q project where I have to maintain a list of list
>> for objects.
>>
>> For e.g. A Question can have list of Answers, and each Answer can
>> then have list of Comments.
>>
>> --
>> As of now I have 3 tables. Questions, Answers, and Comments. I have
>> stored UID of Answers in List for question, and then each
>> answer has List in separate table. [Optionally a Comment
>> may have replies]
>>
>> I do multiple queries to find the complete answers-list and then its
>> related comments.
>>
>> This whole thing looks inefficient to me.
>> --
>>
>> Question:
>> *Is there a better way to do it in Cassandra*. What can I do as far
>> as re-designing database to have lesser queries.
>>
>>
>>
>> Best Regards,
>> Sandeep Kalra
>>
>>
>

>>>
>>
>


Re: List of List

2016-03-01 Thread Jack Krupansky
Thrift? Hah! Sorry, I can't help you if you are going that route. I
recommend CQL - only.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 4:47 PM, Sandeep Kalra 
wrote:

> The way I was planning is to give a restful interface to lookup details of
> a question, and then user must get complete list of answers and its
> comments. I am using thrift interface and node-js to serve it. Search on
> questions are  using subject tag and/or its content,
>
>
>
> Best Regards,
> Sandeep Kalra
>
>
> On Tue, Mar 1, 2016 at 2:49 PM, Jack Krupansky 
> wrote:
>
>> Okay, so a very large number of questions, each with a very modest number
>> of answers (generally under 5), each with a modest number of comments
>> (generally under 5).
>>
>> Now we're back to the issue of how you wish to query and access the data.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 12:39 PM, Sandeep Kalra 
>> wrote:
>>
>>> ​I do not have limit of number of Answers or its comments.​ Assume it to
>>> be clone of StackOverflow..
>>>
>>>
>>>
>>> Best Regards,
>>> Sandeep Kalra
>>>
>>>
>>> On Tue, Mar 1, 2016 at 11:29 AM, Jack Krupansky <
>>> jack.krupan...@gmail.com> wrote:
>>>
 Clustering columns are your friends.

 But the first question is how you need to query the data. Queries drive
 data models in Cassandra.

 What is the cardinality of this data - how many answers per question
 and how many comments per answer?


 -- Jack Krupansky

 On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra  wrote:

> Hi all.
>
> I am beginner in Cassandra.
>
> I am working on Q project where I have to maintain a list of list
> for objects.
>
> For e.g. A Question can have list of Answers, and each Answer can then
> have list of Comments.
>
> --
> As of now I have 3 tables. Questions, Answers, and Comments. I have
> stored UID of Answers in List for question, and then each
> answer has List in separate table. [Optionally a Comment
> may have replies]
>
> I do multiple queries to find the complete answers-list and then its
> related comments.
>
> This whole thing looks inefficient to me.
> --
>
> Question:
> *Is there a better way to do it in Cassandra*. What can I do as far
> as re-designing database to have lesser queries.
>
>
>
> Best Regards,
> Sandeep Kalra
>
>

>>>
>>
>


Re: Commit log size vs memtable total size

2016-03-01 Thread Jack Krupansky
It would be nice to get this info into the doc or at least a blog post.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 4:37 PM, Tyler Hobbs  wrote:

>
> On Tue, Mar 1, 2016 at 6:13 AM, Vlad  wrote:
>
>> So commit log can't keep more than memtable size, why is difference in
>> commit log and memtables sizes?
>
>
> In order to purge a commitlog segment, *all* memtables that contain data
> from that segment must be flushed to disk.
>
> Suppose you have two tables:
>  - table A has extremely high throughput
>  - table B has low throughput
>
> Every commitlog segment will have a mixture of writes for table A and
> table B.  The memtable for table A will fill up rapidly and will be flushed
> frequently.  The memtable for table B will slowly filly up, and will not be
> flushed often.  Since table B's memtable isn't flushed, none of the commit
> log segments can purged/recycled.  Once the commitlog hits its size limit,
> it will force a flush of table B.
>
> This behavior is good, because it allows table B to be flushed in large
> chunks instead of hundreds of tiny sstables.  If the commitlog space were
> equal to the memtable space, Cassandra would have to force a flush of table
> B's memtable approximately every time table A is flushed, despite being
> much smaller.
>
> To summarize: if you use more than one table, it makes sense to have a
> larger space for commitlog segments.
>
> --
> Tyler Hobbs
> DataStax 
>


Re: Querying on index

2016-03-01 Thread Rakesh Kumar


At this time no one else is using this table. So the data is static.

-Original Message-
From: Rakesh Kumar 
To: user 
Sent: Tue, Mar 1, 2016 4:54 pm
Subject: Querying on index

Cassandra: 3.3On my test system I create a tablecreate table eventinput(   
event_id varchar ,event_class_cd int ,event_ts timestamp ,
client_id varchar ,event_message  text ,primary key 
((client_id,event_id),event_ts)) I created an index on client_id create index 
idx1 on eventinput(client_id);When I query select *from eventinputwhere 
client_id = 'aa' ALLOW filtering ;I get random results. One time it is 200, 
another time 400 or 500 or 600 and sometimes 0.Why ?



Querying on index

2016-03-01 Thread Rakesh Kumar
Cassandra: 3.3

On my test system I create a table

create table eventinput
(
event_id varchar ,
event_class_cd int ,
event_ts timestamp ,
client_id varchar ,
event_message  text ,
primary key ((client_id,event_id),event_ts)
) 

I created an index on client_id 
create index idx1 on eventinput(client_id);

When I query 
select *
from eventinput
where client_id = 'aa' 
ALLOW filtering ;

I get random results. One time it is 200, another time 400 or 500 or 600 and 
sometimes 0.

Why ?





Re: Isolation for atomic batch on the same partition key

2016-03-01 Thread Tyler Hobbs
On Mon, Feb 22, 2016 at 3:58 PM, Yawei Li  wrote:

>
> 1. If  an atomic batch (logged batch) contains a bunch of row mutations
> and all of them have the same partition key, can I assume all those changes
> have the same isolation as the row-level isolation? According to the post
> here http://www.mail-archive.com/user%40cassandra.apache.org/msg42434.html,
> it seems that we can get strong isolation.
> e.g.
> *BEGIN BATCH*
> *  UPDATE a IF condition_1;*
> *  INSERT b;*
> *  INSERT c;*
> *APPLY BATCH*
>
> So at any replica, we expect isolation for the three changes on *a*, *b*,
> *c*  (*a* , *b*, *c* have the same partition key *k1*) -- i.e. either
> none or all of them are visible. Can someone help confirm?
>

That is correct.


>
> 2. Say in the above batch, we include two extra row mutations d and e for
> another partition key *k2*.  Will the changes on (*a*, *b*, *c*)  and (*d*
> , *e*) still atomic respectively in terms of isolation? I understand
> there is no isolation between (*a*, *b*, *c*) and (*d*, *e*).  I.e. is
> there a per-parition-key isolation guaranteed?
>

You can't use LWT conditions (i.e. "IF condition_1") in batches that span
multiple partitions keys.  If you did not include the condition, then you
would get per-partition isolation, as you describe.


>
>
> 3. I assume CL SERIAL or LOCAL_SERIAL on reads will try applying the above
> logged batch if it is committed but not applied. Right?
>

Correct.

-- 
Tyler Hobbs
DataStax 


Re: IF NOT EXISTS with multiple static columns confusion

2016-03-01 Thread Tyler Hobbs
What version of Cassandra are you using?  I just tested this out against
trunk and got reasonable behavior:


cqlsh:ks1> CREATE TABLE test (k int, s1 int static, s2 int static, c int, v
int, PRIMARY KEY (k, c));
cqlsh:ks1> INSERT INTO test (k, c, v) VALUES (0, 0, 0);
cqlsh:ks1> UPDATE test SET s1 = 0 WHERE k = 0 IF s1 = null;

 [applied]
---
  True

cqlsh:ks1> TRUNCATE test;
cqlsh:ks1> INSERT INTO test (k, c, v) VALUES (0, 0, 0);
cqlsh:ks1> INSERT INTO test (k, s1) VALUES (0, 0) IF NOT EXISTS;

 [applied]
---
  True



On Tue, Feb 23, 2016 at 6:15 PM, Nimi Wariboko Jr 
wrote:

> I have a table with 2 static columns, and I write to either one of them,
> if I then write to the other one using IF NOT EXISTS, it fails even though
> it has never been written too before. Is it the case that all static
> columns share the same "written too" marker?
>
> Given a table like so:
>
> CREATE TABLE test (
>   id timeuuid,
>   foo int static,
>   bar int static,
>   baz int,
>   baq int
>   PRIMARY KEY (id, baz)
> )
>
> I'm seeing some confusing behavior see the statements below -
>
> """
> INSERT INTO cmpayments.report_payments (id, foo) VALUES (NOW(), 1) IF NOT
> EXISTS; // succeeds
> TRUNCATE test;
> INSERT INTO cmpayments.report_payments (id, baq) VALUES
> (99c3-b01a-11e5-b170-0242ac110002, 1);
> UPDATE cmpayments.report_payments SET foo = 1 WHERE
> id=99c3-b01a-11e5-b170-0242ac110002 IF foo=null; // fails, even though
> foo=null
> TRUNCATE test;
> INSERT INTO cmpayments.report_payments (id, bar) VALUES
> (99c3-b01a-11e5-b170-0242ac110002, 1); // succeeds
> INSERT INTO cmpayments.report_payments (id, foo) VALUES (NOW(), 1) IF NOT
> EXISTS; // fails, even though foo=null, and has never been written too
> """
>
> Nimi
>



-- 
Tyler Hobbs
DataStax 


Re: List of List

2016-03-01 Thread Sandeep Kalra
The way I was planning is to give a restful interface to lookup details of
a question, and then user must get complete list of answers and its
comments. I am using thrift interface and node-js to serve it. Search on
questions are  using subject tag and/or its content,



Best Regards,
Sandeep Kalra


On Tue, Mar 1, 2016 at 2:49 PM, Jack Krupansky 
wrote:

> Okay, so a very large number of questions, each with a very modest number
> of answers (generally under 5), each with a modest number of comments
> (generally under 5).
>
> Now we're back to the issue of how you wish to query and access the data.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 12:39 PM, Sandeep Kalra 
> wrote:
>
>> ​I do not have limit of number of Answers or its comments.​ Assume it to
>> be clone of StackOverflow..
>>
>>
>>
>> Best Regards,
>> Sandeep Kalra
>>
>>
>> On Tue, Mar 1, 2016 at 11:29 AM, Jack Krupansky > > wrote:
>>
>>> Clustering columns are your friends.
>>>
>>> But the first question is how you need to query the data. Queries drive
>>> data models in Cassandra.
>>>
>>> What is the cardinality of this data - how many answers per question and
>>> how many comments per answer?
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra 
>>> wrote:
>>>
 Hi all.

 I am beginner in Cassandra.

 I am working on Q project where I have to maintain a list of list for
 objects.

 For e.g. A Question can have list of Answers, and each Answer can then
 have list of Comments.

 --
 As of now I have 3 tables. Questions, Answers, and Comments. I have
 stored UID of Answers in List for question, and then each
 answer has List in separate table. [Optionally a Comment
 may have replies]

 I do multiple queries to find the complete answers-list and then its
 related comments.

 This whole thing looks inefficient to me.
 --

 Question:
 *Is there a better way to do it in Cassandra*. What can I do as far as
 re-designing database to have lesser queries.



 Best Regards,
 Sandeep Kalra


>>>
>>
>


Re: List of List

2016-03-01 Thread Jack Krupansky
Okay, so a very large number of questions, each with a very modest number
of answers (generally under 5), each with a modest number of comments
(generally under 5).

Now we're back to the issue of how you wish to query and access the data.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 12:39 PM, Sandeep Kalra 
wrote:

> ​I do not have limit of number of Answers or its comments.​ Assume it to
> be clone of StackOverflow..
>
>
>
> Best Regards,
> Sandeep Kalra
>
>
> On Tue, Mar 1, 2016 at 11:29 AM, Jack Krupansky 
> wrote:
>
>> Clustering columns are your friends.
>>
>> But the first question is how you need to query the data. Queries drive
>> data models in Cassandra.
>>
>> What is the cardinality of this data - how many answers per question and
>> how many comments per answer?
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra 
>> wrote:
>>
>>> Hi all.
>>>
>>> I am beginner in Cassandra.
>>>
>>> I am working on Q project where I have to maintain a list of list for
>>> objects.
>>>
>>> For e.g. A Question can have list of Answers, and each Answer can then
>>> have list of Comments.
>>>
>>> --
>>> As of now I have 3 tables. Questions, Answers, and Comments. I have
>>> stored UID of Answers in List for question, and then each
>>> answer has List in separate table. [Optionally a Comment
>>> may have replies]
>>>
>>> I do multiple queries to find the complete answers-list and then its
>>> related comments.
>>>
>>> This whole thing looks inefficient to me.
>>> --
>>>
>>> Question:
>>> *Is there a better way to do it in Cassandra*. What can I do as far as
>>> re-designing database to have lesser queries.
>>>
>>>
>>>
>>> Best Regards,
>>> Sandeep Kalra
>>>
>>>
>>
>


Re: Snitch for AWS EC2 nondefaultVPC

2016-03-01 Thread Asher Newcomer
Hi Arun,

This distinction has been a can of worms for me also - and I'm not sure my
understanding is entirely correct.

I use GossipingPropertyFileSnitch for my multi-region setup, which seems to
be more flexible than the Ec2 snitches. The Ec2 snitches should work also,
but their behavior is more opaque from my perspective.

AFAIK - if all of your nodes can reach each other via private IP, and your
anticipated clients can reach all nodes via their private IP, then using
the private IP address as the broadcast_address is fine.

If there will ever be a situation where a client or node will need to reach
some part of the cluster using public IPs, then public IPs should be used
as the broadcast_address.

A simple flow-chart / diagram of how these various settings are used by
Cassandra would be very helpful for people new to the project.

Regards,

Asher

On Tue, Mar 1, 2016 at 3:12 PM, Arun Sandu  wrote:

> Hi all,
>
> All our nodes are launched in AWS EC2 VPC (private). We have 2
> datacenters(1 us-east , 1- asiapacific) and all communication is through
> private IP's and don't have any public IPs. What is the recommended snitch
> to be used? We currently have GossipingPropertyFileSnitch.
>
> 1. If Ec2MultiRegionSnitch, then what would be the broadcast_address?
> 2. If not Ec2MultiRegionSnitch, which snitch better fits this environment?
>
> *Ref:*
> As per the document Ec2MultiRegionSnitch
> ,set
> the listen_address
> 
> to the *private* IP address of the node, and the broadcast_address
> 
> to the *public* IP address of the node.
>
> --
> Thanks
> Arun
>


RE: Snitch for AWS EC2 nondefaultVPC

2016-03-01 Thread Jun Wu
I've worked on some experiments with AWS EC2. According to the doc you provided 
and from my own experience, EC2Multiregionsnitich should be the right setting 
as you have 2 different datacenters.
In cassandra.yaml: change seeds to public address list, change listen and rpc 
address to private address, change broadcast address to public address, use 
Ec2MultiRegionSnitch.
Hope it works!

Date: Tue, 1 Mar 2016 15:12:04 -0500
Subject: Snitch for AWS EC2 nondefaultVPC
From: arunsandu...@gmail.com
To: user@cassandra.apache.org

Hi all,

All our nodes are launched in AWS EC2 VPC (private). We have 2 datacenters(1 
us-east , 1- asiapacific) and all communication is through private IP's and 
don't have any public IPs. What is the recommended snitch to be used? We 
currently have GossipingPropertyFileSnitch.

1. If Ec2MultiRegionSnitch, then what would be the broadcast_address?
2. If not Ec2MultiRegionSnitch, which snitch better fits this environment?

Ref:
As per the document Ec2MultiRegionSnitch,set  the listen_address to the private
  IP address of the node, and the broadcast_address to the public IP 
address of the node. 
-- 
Thanks
Arun

  

Re: Cassandra Ussages

2016-03-01 Thread Andrés Ivaldi
Hello Jack
What do you mind with "the map datatype with string key values effectively
gives you extensible columns"

Regards

On Tue, Mar 1, 2016 at 1:34 PM, Jack Krupansky 
wrote:

> OLAP using Cassandra and Spark:
>
> http://www.slideshare.net/EvanChan2/breakthrough-olap-performance-with-cassandra-and-spark
>
> What is the cardinality of your cube dimenstions? Obviously any
> multi-dimensional data must be flattened.
>
> Cassandra tables have fixed named columns, but... the map datatype with
> string key values effectively gives you extensible columns.
>
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 11:22 AM, Andrés Ivaldi  wrote:
>
>> Jonathan thanks for the link,
>> I believe that maybe is good as Data Store part, because is fast for I/o
>> and handles Time Series, for analytics could be with Apache Ignite and/or
>> Apache Spark
>> what it worries me is that looks very complex create the structure for
>> each Fact table and then extend
>>
>> regards.
>>
>> On Sun, Feb 28, 2016 at 12:28 PM, Jonathan Haddad 
>> wrote:
>>
>>> Cassandra is primarily used as an OLTP database, not analytics. You
>>> should watch this 30 min video discussing Cassandra core concepts (coming
>>> from a relational background):
>>> https://academy.datastax.com/courses/ds101-introduction-cassandra
>>>
>>> On Sun, Feb 28, 2016 at 5:40 AM Andrés Ivaldi 
>>> wrote:
>>>
 Hello, At my work we are looking for new technologies for an Analysis
 Engine, and we are evaluating differents technologies one of them is
 Cassandra as our Data repository.

 Now we can execute query analysis agains an OLAP Cube and RDBMS, using
 MSSQL as our data repository. Cube is obsolete and SQL server engine is
 slow as data repository.

 I don't know much about cassandra, I read some books, and looks to fit
 well on what we are needing, but there are some things that looks like a
 problem for us.

 Our engine is designed to be scalable, flexible and dynamic, any user
 can add new dimensions or measures from any source, all the data is stored
 on Cube(this is fixed data) and MSSQL(dynamic data) so we have decoupled
 tables with the dimension values.


 Ok, with the context given I'll like to clear some doubts

 - I able to flat the table with all the possible dimension values to
 cassandra, creating the pk against the dimension columns? this will give me
 the "sensation" of data pivot over the PK columns? If correct, what if I
 want to select the order of the columns, or add another or reduce them?
 - It's possible to extend the values of a row dynamically? What we do
 often is join row against a value of a mapped external data value to extend
 the dimensions hierarchical value structure (ie state->Country->Continent)

 I know we can do some of this things in the core of our engine, like
 the dimension extension of the values or reduce columns, but as we are
 evaluating differents technologies is good to know.

 Regards!!


 --
 Ing. Ivaldi Andres

>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


-- 
Ing. Ivaldi Andres


Re: Cassandra Ussages

2016-03-01 Thread Andrés Ivaldi
Thanks all for the tips,
Mainly we are replacing an OLAP cube, but our engine works fine with RDBMS
directly so with the low latency of cassandra it could work nice
(extensibility of this is what worries me).
We will give a try to Cassandra + Spark

Thanks again!!

On Tue, Mar 1, 2016 at 2:59 PM, Jack Krupansky 
wrote:

> I would spin it as Cassandra being the right choice where your primary
> need in OLTP and with a secondary need for analytics. IOW, where you would
> otherwise need to use two separate databases for the same data.
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 12:40 PM, Jonathan Haddad 
> wrote:
>
>> Spark & Cassandra work just fine together, but, as I said, Cassandra is
>> *primarily* used for OLTP.  If your main use case is analytics, I would use
>> something that's built for analytics.  If 90%+ of your queries are going to
>> be 1-10ms & customer facing, then you're good to go.  If you're building
>> something to replace OLAP cubes, I'd look at something else.
>>
>> On Tue, Mar 1, 2016 at 8:52 AM Jack Krupansky 
>> wrote:
>>
>>> OLAP using Cassandra and Spark:
>>>
>>> http://www.slideshare.net/EvanChan2/breakthrough-olap-performance-with-cassandra-and-spark
>>>
>>> What is the cardinality of your cube dimenstions? Obviously any
>>> multi-dimensional data must be flattened.
>>>
>>> Cassandra tables have fixed named columns, but... the map datatype with
>>> string key values effectively gives you extensible columns.
>>>
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 11:22 AM, Andrés Ivaldi 
>>> wrote:
>>>
 Jonathan thanks for the link,
 I believe that maybe is good as Data Store part, because is fast for
 I/o and handles Time Series, for analytics could be with Apache Ignite
 and/or Apache Spark
 what it worries me is that looks very complex create the structure for
 each Fact table and then extend

 regards.

 On Sun, Feb 28, 2016 at 12:28 PM, Jonathan Haddad 
 wrote:

> Cassandra is primarily used as an OLTP database, not analytics. You
> should watch this 30 min video discussing Cassandra core concepts (coming
> from a relational background):
> https://academy.datastax.com/courses/ds101-introduction-cassandra
>
> On Sun, Feb 28, 2016 at 5:40 AM Andrés Ivaldi 
> wrote:
>
>> Hello, At my work we are looking for new technologies for an Analysis
>> Engine, and we are evaluating differents technologies one of them is
>> Cassandra as our Data repository.
>>
>> Now we can execute query analysis agains an OLAP Cube and RDBMS,
>> using MSSQL as our data repository. Cube is obsolete and SQL server 
>> engine
>> is slow as data repository.
>>
>> I don't know much about cassandra, I read some books, and looks to
>> fit well on what we are needing, but there are some things that looks 
>> like
>> a problem for us.
>>
>> Our engine is designed to be scalable, flexible and dynamic, any user
>> can add new dimensions or measures from any source, all the data is 
>> stored
>> on Cube(this is fixed data) and MSSQL(dynamic data) so we have decoupled
>> tables with the dimension values.
>>
>>
>> Ok, with the context given I'll like to clear some doubts
>>
>> - I able to flat the table with all the possible dimension values to
>> cassandra, creating the pk against the dimension columns? this will give 
>> me
>> the "sensation" of data pivot over the PK columns? If correct, what if I
>> want to select the order of the columns, or add another or reduce them?
>> - It's possible to extend the values of a row dynamically? What we do
>> often is join row against a value of a mapped external data value to 
>> extend
>> the dimensions hierarchical value structure (ie 
>> state->Country->Continent)
>>
>> I know we can do some of this things in the core of our engine, like
>> the dimension extension of the values or reduce columns, but as we are
>> evaluating differents technologies is good to know.
>>
>> Regards!!
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>


 --
 Ing. Ivaldi Andres

>>>
>>>
>


-- 
Ing. Ivaldi Andres


Snitch for AWS EC2 nondefaultVPC

2016-03-01 Thread Arun Sandu
Hi all,

All our nodes are launched in AWS EC2 VPC (private). We have 2
datacenters(1 us-east , 1- asiapacific) and all communication is through
private IP's and don't have any public IPs. What is the recommended snitch
to be used? We currently have GossipingPropertyFileSnitch.

1. If Ec2MultiRegionSnitch, then what would be the broadcast_address?
2. If not Ec2MultiRegionSnitch, which snitch better fits this environment?

*Ref:*
As per the document Ec2MultiRegionSnitch
,set
the listen_address

to the *private* IP address of the node, and the broadcast_address

to the *public* IP address of the node.

-- 
Thanks
Arun


Re: Practical limit on number of column families

2016-03-01 Thread Eric Stevens
It's definitely not true for every use case of a large number of tables,
but for many uses where you'd be tempted to do that, adding whatever would
have driven your table naming instead as a column in your partition key on
a smaller number of tables will meet your needs.  This is especially true
if you're looking to solve multi-tenancy, unless you let your tenants
dynamically drive your schema (which is a separate can of worms).

On Tue, Mar 1, 2016 at 9:08 AM Jack Krupansky 
wrote:

> I don't think Cassandra was "purposefully developed" for some target
> number of tables - there is no evidence of any such an explicit intent.
> Instead, it would be fair to say that Cassandra was "not purposefully
> developed" with a goal of supporting "large numbers of tables." Sometimes
> features and capabilities come for free or as a side effect of the
> technologies used, but usually specific features and specific capabilities
> (such as large numbers of tables) require explicit intent and explicit
> effort.
>
> One could indeed endeavor to design a data store (I'm not even sure it
> would still be considered a database per se) that supported either large
> numbers of tables or an additional level of storage model in between table
> and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
> not designed with that goal in mind.
>
> Traditionally, a "table" is a defined relation over a set of data.
> Relation and data are distinct concepts. And a relation name is not simply
> a Java-style "object". A relation (table) name is supposed to represent an
> abstraction or entity type, while essentially all of the cases I have heard
> of for wanting thousands (or even hundreds) of tables are trying to use
> table as more of a container for a group of rows for a specific entity
> instance rather than a distinct entity type. Granted, Cassandra is not
> obligated to be limited to the relational model, but Cassandra, especially
> CQL, is intentionally modeled reasonably closely with the relational model
> in terms of the data modeling abstractions even though the storage engine
> is designed to scale across nodes.
>
> You could file a Jira requesting such a feature improvement. And then we
> would see if sentiment has shifted over the years.
>
> The key thing is to offer up a use case that warrants support for large
> numbers of tables. So far, it has usually been the case that the perceived
> need for separate tables could easily be met using clustering columns of a
> single table.
>
> Seriously, if you guys can define a legitimate use case that can't easily
> be handled by a single table, that could get the discussion started.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi Jack
>>
>> Being purposefully developed to only handle up to “a few hundred” tables
>> is reason enough. I accept that, and likely a use case with many tables was
>> never really considered. But I would still like to understand the design
>> choices made so perhaps we gain some confidence level in this upper limit
>> in the number of tables. The best estimate we have so far is “a few
>> hundred” which is a bit vague.
>>
>> Regarding scaling, I’m not talking about scaling in terms of data volume,
>> but on how the data is structured. One thousand tables with one row each is
>> the same data volume as one table with one thousand rows, excluding any
>> data structures required to maintain the extra tables. But whereas the
>> first seems likely to bring a Cassandra cluster to its knees, the second
>> will run happily on a single node cluster in a low end machine.
>>
>> We will design our code to use a single table to avoid having nightmares
>> with this issue. But if there is any authoritative documentation on this
>> characteristic of Cassandra, I would love to know more.
>>
>> FJ
>>
>>
>> On 01 Mar 2016, at 14:23, Jack Krupansky 
>> wrote:
>>
>> I don't think there are any "reasons behind it." It is simply empirical
>> experience - as reported here.
>>
>> Cassandra scales in two dimension - number of rows per node and number of
>> nodes. If some source of information lead you to believe otherwise, please
>> point out the source so that we can endeavor to correct it.
>>
>> The exact number of rows per node and tables per node will always have to
>> be evaluated empirically - a proof of concept implementation, since it all
>> depends on the mix of capabilities of your hardware combined with your
>> specific data model, your specific data values, your specific access
>> patterns, and your specific load. And it also depends on your own personal
>> tolerance for degradation of latency and throughput - some people might
>> find a given set of performance  metrics acceptable while other might not.
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com> wrote:

Re: List of List

2016-03-01 Thread Sandeep Kalra
Thanks a lot.

I have started with the videos too..I will get back if I see any problem.


Best Regards,
Sandeep Kalra


On Tue, Mar 1, 2016 at 12:36 PM, Jonathan Haddad  wrote:

> I'd do something like this:
>
> CREATE TABLE questions (
> question_id timeuuid primary key,
> question text
> );
>
> CREATE TABLE answers (
> question_id timeuuid,
> answer_id timeuuid,
> answer text,
> primary key(question_id, answer_id)
> );
>
> CREATE TABLE comments (
> answer_id timeuuid,
> comment_id timeuuid,
> comment text,
> primary key(question_id, answer_id)
> );
>
> You can select all the answers for a given question (ordered by the time
> they appeared, yay) with :
> SELECT * from answers where question_id = ?
>
> Same applies to comments.
>
> If you want to do categories as well, you'd want to modify the question
> table to have category_id as the partition key.  Again, I suggest you watch
> the videos in Datastax Academy and not try to shortcut your data modeling
> knowledge as it's really, really important and screwing it up will cost you
> 100x the time as well as about a million headaches.
>
> On Tue, Mar 1, 2016 at 9:59 AM Sandeep Kalra 
> wrote:
>
>> ​I do not have limit of number of Answers or its comments.​ Assume it to
>> be clone of StackOverflow..
>>
>>
>>
>> Best Regards,
>> Sandeep Kalra
>>
>>
>> On Tue, Mar 1, 2016 at 11:29 AM, Jack Krupansky > > wrote:
>>
>>> Clustering columns are your friends.
>>>
>>> But the first question is how you need to query the data. Queries drive
>>> data models in Cassandra.
>>>
>>> What is the cardinality of this data - how many answers per question and
>>> how many comments per answer?
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra 
>>> wrote:
>>>
 Hi all.

 I am beginner in Cassandra.

 I am working on Q project where I have to maintain a list of list for
 objects.

 For e.g. A Question can have list of Answers, and each Answer can then
 have list of Comments.

 --
 As of now I have 3 tables. Questions, Answers, and Comments. I have
 stored UID of Answers in List for question, and then each
 answer has List in separate table. [Optionally a Comment
 may have replies]

 I do multiple queries to find the complete answers-list and then its
 related comments.

 This whole thing looks inefficient to me.
 --

 Question:
 *Is there a better way to do it in Cassandra*. What can I do as far as
 re-designing database to have lesser queries.



 Best Regards,
 Sandeep Kalra


>>>
>>


Re: Checking replication status

2016-03-01 Thread Bryan Cheng
HI Jeremy,

For more insight into the hint system, these two blog posts are great
resources: http://www.datastax.com/dev/blog/modern-hinted-handoff, and
http://www.datastax.com/dev/blog/whats-coming-to-cassandra-in-3-0-improved-hint-storage-and-delivery
.

For timeframes, that's going to differ based on your read/write patterns
and load. Although I haven't tried this before, I believe you can
query the system.hints
table to see the status of hints queued by the local machine.

--local and --dc are similar in the sense that they are always repairs
against the local datacenter, they just differ in syntax. If you sustain
loss of inter-dc connectivity for longer than max_hint_window_in_ms, you'll
want to run a cross-dc repair, which is just the standard full repair
(without specifying either).

On Mon, Feb 29, 2016 at 7:38 PM, Jimmy Lin  wrote:

> hi Bryan,
> I guess I want to find out if there is any way to tell when data will
> become consistent again in both cases.
>
> if the node being down shorter than the max_hint_window(say 2 hours out of
> 3 hrs max), is there anyway to check the log or JMX etc to see if the hint
> queue size back to zero or lower range?
>
>
> if node goes down longer than max_hint_window time (say 4 hrs hours > our
> max 3 hrs), we run repair job. What is the correct nodetool repair job
> syntax to use?
> in particular what is the difference between -local vs -dc? they both
> seems to indicate repairing nodes within a datacenter, but for across DC
> network outage, we want to repair nodes across DCs right?
>
> thanks
>
>
>
> On Fri, Feb 26, 2016 at 3:38 PM, Bryan Cheng 
> wrote:
>
>> Hi Jimmy,
>>
>> If you sustain a long downtime, repair is almost always the way to go.
>>
>> It seems like you're asking to what extent a cluster is able to
>> recover/resync a downed peer.
>>
>> A peer will not attempt to reacquire all the data it has missed while
>> being down. Recovery happens in a few ways:
>>
>> 1) Hints: Assuming that there are enough peers to satisfy your quorum
>> requirements on write, the live peers will queue up these operations for up
>> to max_hint_window_in_ms (from cassandra.yaml). These hints will be
>> delivered once the peer recovers.
>> 2) Read repair: There is a probability that read repair will happen,
>> meaning that a query will trigger data consistency checks and updates _on
>> the query being performed_.
>> 3) Repair.
>>
>> If a machine goes down for longer than max_hint_window_in_ms, AFAIK you
>> _will_ have missing data. If you cannot tolerate this situation, you need
>> to take a look at your tunable consistency and/or trigger a repair.
>>
>> On Thu, Feb 25, 2016 at 7:26 PM, Jimmy Lin  wrote:
>>
>>> so far they are not long, just some config change and restart.
>>> if it is a 2 hrs downtime due to whatever reason, a repair is better
>>> option than trying to figure out if replication syn finish or not?
>>>
>>> On Thu, Feb 25, 2016 at 1:09 PM, daemeon reiydelle 
>>> wrote:
>>>
 Hmm. What are your processes when a node comes back after "a long
 offline"? Long enough to take the node offline and do a repair? Run the
 risk of serving stale data? Parallel repairs? ???

 So, what sort of time frames are "a long time"?


 *...*



 *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
 <%28%2B44%29%20%280%29%2020%208144%209872>*

 On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin  wrote:

> hi all,
>
> what are the better ways to check replication overall status of cassandra 
> cluster?
>
>  within a single DC, unless a node is down for long time, most of the 
> time i feel it is pretty much non-issue and things are replicated pretty 
> fast. But when a node come back from a long offline, is there a way to 
> check that the node has finished its data sync with other nodes  ?
>
>  Now across DC, we have frequent VPN outage (sometime short sometims 
> long) between DCs, i also like to know if there is a way to find how the 
> replication progress between DC catching up under this condtion?
>
>  Also, if i understand correctly, the only gaurantee way to make sure 
> data are synced is to run a complete repair job,
> is that correct? I am trying to see if there is a way to "force a quick 
> replication sync" between DCs after vpn outage.
> Or maybe this is unnecessary, as Cassandra will catch up as fast as it 
> can, there is nothing else we/(system admin) can do to make it faster or 
> better?
>
>
>
> Sent from my iPhone
>


>>>
>>
>


Re: List of List

2016-03-01 Thread Jonathan Haddad
I'd do something like this:

CREATE TABLE questions (
question_id timeuuid primary key,
question text
);

CREATE TABLE answers (
question_id timeuuid,
answer_id timeuuid,
answer text,
primary key(question_id, answer_id)
);

CREATE TABLE comments (
answer_id timeuuid,
comment_id timeuuid,
comment text,
primary key(question_id, answer_id)
);

You can select all the answers for a given question (ordered by the time
they appeared, yay) with :
SELECT * from answers where question_id = ?

Same applies to comments.

If you want to do categories as well, you'd want to modify the question
table to have category_id as the partition key.  Again, I suggest you watch
the videos in Datastax Academy and not try to shortcut your data modeling
knowledge as it's really, really important and screwing it up will cost you
100x the time as well as about a million headaches.

On Tue, Mar 1, 2016 at 9:59 AM Sandeep Kalra 
wrote:

> ​I do not have limit of number of Answers or its comments.​ Assume it to
> be clone of StackOverflow..
>
>
>
> Best Regards,
> Sandeep Kalra
>
>
> On Tue, Mar 1, 2016 at 11:29 AM, Jack Krupansky 
> wrote:
>
>> Clustering columns are your friends.
>>
>> But the first question is how you need to query the data. Queries drive
>> data models in Cassandra.
>>
>> What is the cardinality of this data - how many answers per question and
>> how many comments per answer?
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra 
>> wrote:
>>
>>> Hi all.
>>>
>>> I am beginner in Cassandra.
>>>
>>> I am working on Q project where I have to maintain a list of list for
>>> objects.
>>>
>>> For e.g. A Question can have list of Answers, and each Answer can then
>>> have list of Comments.
>>>
>>> --
>>> As of now I have 3 tables. Questions, Answers, and Comments. I have
>>> stored UID of Answers in List for question, and then each
>>> answer has List in separate table. [Optionally a Comment
>>> may have replies]
>>>
>>> I do multiple queries to find the complete answers-list and then its
>>> related comments.
>>>
>>> This whole thing looks inefficient to me.
>>> --
>>>
>>> Question:
>>> *Is there a better way to do it in Cassandra*. What can I do as far as
>>> re-designing database to have lesser queries.
>>>
>>>
>>>
>>> Best Regards,
>>> Sandeep Kalra
>>>
>>>
>>
>


Re: Consistent read timeouts for bursts of reads

2016-03-01 Thread Carlos Alonso
We have had similar issues sometimes.

Usually the problem was that failing queries where reading the same
partition that another query still running and that partition is too big.

The fact that is reading the same partition is why your query works upon
retry. The fact that the partition (or the retrieved range) is too big is
why the nodes get overloaded and end up dropping the read requests.

If you see GC pressure that would point towards my hypothesis too.

Hope this helps.

Carlos Alonso | Software Engineer | @calonso 

On 25 February 2016 at 16:34, Emīls Šolmanis 
wrote:

> Having had a read through the archives, I missed this at first, but this
> seems to be *exactly* like what we're experiencing.
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>
> Only difference is we're getting this for reads and using CQL, but the
> behaviour is identical.
>
> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
> wrote:
>
>> Hello,
>>
>> We're having a problem with concurrent requests. It seems that whenever
>> we try resolving more
>> than ~ 15 queries at the same time, one or two get a read timeout and
>> then succeed on a retry.
>>
>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>> AWS.
>>
>> What we've found while investigating:
>>
>>  * this is not db-wide. Trying the same pattern against another table
>> everything works fine.
>>  * it fails 1 or 2 requests regardless of how many are executed in
>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>> requests and doesn't seem to scale up.
>>  * the problem is consistently reproducible. It happens both under
>> heavier load and when just firing off a single batch of requests for
>> testing.
>>  * tracing the faulty requests says everything is great. An example
>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>  * the only peculiar thing in the logs is there's no acknowledgement of
>> the request being accepted by the server, as seen in
>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>  * there's nothing funny in the timed out Cassandra node's logs around
>> that time as far as I can tell, not even in the debug logs.
>>
>> Any ideas about what might be causing this, pointers to server config
>> options, or how else we might debug this would be much appreciated.
>>
>> Kind regards,
>> Emils
>>
>>


Re: Cassandra Ussages

2016-03-01 Thread Jack Krupansky
I would spin it as Cassandra being the right choice where your primary need
in OLTP and with a secondary need for analytics. IOW, where you would
otherwise need to use two separate databases for the same data.


-- Jack Krupansky

On Tue, Mar 1, 2016 at 12:40 PM, Jonathan Haddad  wrote:

> Spark & Cassandra work just fine together, but, as I said, Cassandra is
> *primarily* used for OLTP.  If your main use case is analytics, I would use
> something that's built for analytics.  If 90%+ of your queries are going to
> be 1-10ms & customer facing, then you're good to go.  If you're building
> something to replace OLAP cubes, I'd look at something else.
>
> On Tue, Mar 1, 2016 at 8:52 AM Jack Krupansky 
> wrote:
>
>> OLAP using Cassandra and Spark:
>>
>> http://www.slideshare.net/EvanChan2/breakthrough-olap-performance-with-cassandra-and-spark
>>
>> What is the cardinality of your cube dimenstions? Obviously any
>> multi-dimensional data must be flattened.
>>
>> Cassandra tables have fixed named columns, but... the map datatype with
>> string key values effectively gives you extensible columns.
>>
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 11:22 AM, Andrés Ivaldi 
>> wrote:
>>
>>> Jonathan thanks for the link,
>>> I believe that maybe is good as Data Store part, because is fast for I/o
>>> and handles Time Series, for analytics could be with Apache Ignite and/or
>>> Apache Spark
>>> what it worries me is that looks very complex create the structure for
>>> each Fact table and then extend
>>>
>>> regards.
>>>
>>> On Sun, Feb 28, 2016 at 12:28 PM, Jonathan Haddad 
>>> wrote:
>>>
 Cassandra is primarily used as an OLTP database, not analytics. You
 should watch this 30 min video discussing Cassandra core concepts (coming
 from a relational background):
 https://academy.datastax.com/courses/ds101-introduction-cassandra

 On Sun, Feb 28, 2016 at 5:40 AM Andrés Ivaldi 
 wrote:

> Hello, At my work we are looking for new technologies for an Analysis
> Engine, and we are evaluating differents technologies one of them is
> Cassandra as our Data repository.
>
> Now we can execute query analysis agains an OLAP Cube and RDBMS, using
> MSSQL as our data repository. Cube is obsolete and SQL server engine is
> slow as data repository.
>
> I don't know much about cassandra, I read some books, and looks to fit
> well on what we are needing, but there are some things that looks like a
> problem for us.
>
> Our engine is designed to be scalable, flexible and dynamic, any user
> can add new dimensions or measures from any source, all the data is stored
> on Cube(this is fixed data) and MSSQL(dynamic data) so we have decoupled
> tables with the dimension values.
>
>
> Ok, with the context given I'll like to clear some doubts
>
> - I able to flat the table with all the possible dimension values to
> cassandra, creating the pk against the dimension columns? this will give 
> me
> the "sensation" of data pivot over the PK columns? If correct, what if I
> want to select the order of the columns, or add another or reduce them?
> - It's possible to extend the values of a row dynamically? What we do
> often is join row against a value of a mapped external data value to 
> extend
> the dimensions hierarchical value structure (ie state->Country->Continent)
>
> I know we can do some of this things in the core of our engine, like
> the dimension extension of the values or reduce columns, but as we are
> evaluating differents technologies is good to know.
>
> Regards!!
>
>
> --
> Ing. Ivaldi Andres
>

>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>


Re: List of List

2016-03-01 Thread Sandeep Kalra
​I do not have limit of number of Answers or its comments.​ Assume it to be
clone of StackOverflow..



Best Regards,
Sandeep Kalra


On Tue, Mar 1, 2016 at 11:29 AM, Jack Krupansky 
wrote:

> Clustering columns are your friends.
>
> But the first question is how you need to query the data. Queries drive
> data models in Cassandra.
>
> What is the cardinality of this data - how many answers per question and
> how many comments per answer?
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra 
> wrote:
>
>> Hi all.
>>
>> I am beginner in Cassandra.
>>
>> I am working on Q project where I have to maintain a list of list for
>> objects.
>>
>> For e.g. A Question can have list of Answers, and each Answer can then
>> have list of Comments.
>>
>> --
>> As of now I have 3 tables. Questions, Answers, and Comments. I have
>> stored UID of Answers in List for question, and then each
>> answer has List in separate table. [Optionally a Comment
>> may have replies]
>>
>> I do multiple queries to find the complete answers-list and then its
>> related comments.
>>
>> This whole thing looks inefficient to me.
>> --
>>
>> Question:
>> *Is there a better way to do it in Cassandra*. What can I do as far as
>> re-designing database to have lesser queries.
>>
>>
>>
>> Best Regards,
>> Sandeep Kalra
>>
>>
>


IOException: MkDirs Failed to Create in Spark

2016-03-01 Thread Anuj Wadehra
Hi
 
We are using Spark with Cassandra. While using rdd.saveAsTextFile("/tmp/dr"), 
we are getting following error when we run the application with root access. 
Spark is able to create two level of directories but fails after that with 
Exception:

16/03/01 22:59:48 WARN TaskSetManager: Lost task 73.3 in stage 0.0 (TID 144, 
host1): java.io.IOException: Mkdirs failed to create 
file:/tmp/dr/_temporary/0/_temporary/attempt_201603012259__m_73_144
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:438)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1056)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Permissions on /tmp:
chmod -R 777 /tmp has been executed and permissions look like:
drwxrwxrwx.  31 root root 1.2K Mar  1 22:54 tmp

Forgive me for raising this question in Cassandra Mailing list. I think Spark & 
Cassandra user base is overlapping, so I expected help here.
I am not yet part of Spark mailing list.

Thanks
Anuj


Re: Cassandra Ussages

2016-03-01 Thread Jonathan Haddad
Spark & Cassandra work just fine together, but, as I said, Cassandra is
*primarily* used for OLTP.  If your main use case is analytics, I would use
something that's built for analytics.  If 90%+ of your queries are going to
be 1-10ms & customer facing, then you're good to go.  If you're building
something to replace OLAP cubes, I'd look at something else.

On Tue, Mar 1, 2016 at 8:52 AM Jack Krupansky 
wrote:

> OLAP using Cassandra and Spark:
>
> http://www.slideshare.net/EvanChan2/breakthrough-olap-performance-with-cassandra-and-spark
>
> What is the cardinality of your cube dimenstions? Obviously any
> multi-dimensional data must be flattened.
>
> Cassandra tables have fixed named columns, but... the map datatype with
> string key values effectively gives you extensible columns.
>
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 11:22 AM, Andrés Ivaldi  wrote:
>
>> Jonathan thanks for the link,
>> I believe that maybe is good as Data Store part, because is fast for I/o
>> and handles Time Series, for analytics could be with Apache Ignite and/or
>> Apache Spark
>> what it worries me is that looks very complex create the structure for
>> each Fact table and then extend
>>
>> regards.
>>
>> On Sun, Feb 28, 2016 at 12:28 PM, Jonathan Haddad 
>> wrote:
>>
>>> Cassandra is primarily used as an OLTP database, not analytics. You
>>> should watch this 30 min video discussing Cassandra core concepts (coming
>>> from a relational background):
>>> https://academy.datastax.com/courses/ds101-introduction-cassandra
>>>
>>> On Sun, Feb 28, 2016 at 5:40 AM Andrés Ivaldi 
>>> wrote:
>>>
 Hello, At my work we are looking for new technologies for an Analysis
 Engine, and we are evaluating differents technologies one of them is
 Cassandra as our Data repository.

 Now we can execute query analysis agains an OLAP Cube and RDBMS, using
 MSSQL as our data repository. Cube is obsolete and SQL server engine is
 slow as data repository.

 I don't know much about cassandra, I read some books, and looks to fit
 well on what we are needing, but there are some things that looks like a
 problem for us.

 Our engine is designed to be scalable, flexible and dynamic, any user
 can add new dimensions or measures from any source, all the data is stored
 on Cube(this is fixed data) and MSSQL(dynamic data) so we have decoupled
 tables with the dimension values.


 Ok, with the context given I'll like to clear some doubts

 - I able to flat the table with all the possible dimension values to
 cassandra, creating the pk against the dimension columns? this will give me
 the "sensation" of data pivot over the PK columns? If correct, what if I
 want to select the order of the columns, or add another or reduce them?
 - It's possible to extend the values of a row dynamically? What we do
 often is join row against a value of a mapped external data value to extend
 the dimensions hierarchical value structure (ie state->Country->Continent)

 I know we can do some of this things in the core of our engine, like
 the dimension extension of the values or reduce columns, but as we are
 evaluating differents technologies is good to know.

 Regards!!


 --
 Ing. Ivaldi Andres

>>>
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>
>


Re: List of List

2016-03-01 Thread Jonathan Haddad
You probably want to watch some intro videos on Datastax Academy.
https://academy.datastax.com/

I suggest the intro video to some basics down:
https://academy.datastax.com/courses/ds101-introduction-cassandra
and then core concepts, a pretty thorough intro:
https://academy.datastax.com/courses/ds201-cassandra-core-concepts

On Tue, Mar 1, 2016 at 9:31 AM Jack Krupansky 
wrote:

> Clustering columns are your friends.
>
> But the first question is how you need to query the data. Queries drive
> data models in Cassandra.
>
> What is the cardinality of this data - how many answers per question and
> how many comments per answer?
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra 
> wrote:
>
>> Hi all.
>>
>> I am beginner in Cassandra.
>>
>> I am working on Q project where I have to maintain a list of list for
>> objects.
>>
>> For e.g. A Question can have list of Answers, and each Answer can then
>> have list of Comments.
>>
>> --
>> As of now I have 3 tables. Questions, Answers, and Comments. I have
>> stored UID of Answers in List for question, and then each
>> answer has List in separate table. [Optionally a Comment
>> may have replies]
>>
>> I do multiple queries to find the complete answers-list and then its
>> related comments.
>>
>> This whole thing looks inefficient to me.
>> --
>>
>> Question:
>> *Is there a better way to do it in Cassandra*. What can I do as far as
>> re-designing database to have lesser queries.
>>
>>
>>
>> Best Regards,
>> Sandeep Kalra
>>
>>
>


Re: List of List

2016-03-01 Thread Jack Krupansky
Clustering columns are your friends.

But the first question is how you need to query the data. Queries drive
data models in Cassandra.

What is the cardinality of this data - how many answers per question and
how many comments per answer?


-- Jack Krupansky

On Tue, Mar 1, 2016 at 12:23 PM, Sandeep Kalra 
wrote:

> Hi all.
>
> I am beginner in Cassandra.
>
> I am working on Q project where I have to maintain a list of list for
> objects.
>
> For e.g. A Question can have list of Answers, and each Answer can then
> have list of Comments.
>
> --
> As of now I have 3 tables. Questions, Answers, and Comments. I have stored
> UID of Answers in List for question, and then each answer
> has List in separate table. [Optionally a Comment may have
> replies]
>
> I do multiple queries to find the complete answers-list and then its
> related comments.
>
> This whole thing looks inefficient to me.
> --
>
> Question:
> *Is there a better way to do it in Cassandra*. What can I do as far as
> re-designing database to have lesser queries.
>
>
>
> Best Regards,
> Sandeep Kalra
>
>


List of List

2016-03-01 Thread Sandeep Kalra
Hi all.

I am beginner in Cassandra.

I am working on Q project where I have to maintain a list of list for
objects.

For e.g. A Question can have list of Answers, and each Answer can then have
list of Comments.

--
As of now I have 3 tables. Questions, Answers, and Comments. I have stored
UID of Answers in List for question, and then each answer
has List in separate table. [Optionally a Comment may have
replies]

I do multiple queries to find the complete answers-list and then its
related comments.

This whole thing looks inefficient to me.
-- 

Question:
*Is there a better way to do it in Cassandra*. What can I do as far as
re-designing database to have lesser queries.



Best Regards,
Sandeep Kalra


Re: Cassandra Ussages

2016-03-01 Thread Jack Krupansky
OLAP using Cassandra and Spark:
http://www.slideshare.net/EvanChan2/breakthrough-olap-performance-with-cassandra-and-spark

What is the cardinality of your cube dimenstions? Obviously any
multi-dimensional data must be flattened.

Cassandra tables have fixed named columns, but... the map datatype with
string key values effectively gives you extensible columns.



-- Jack Krupansky

On Tue, Mar 1, 2016 at 11:22 AM, Andrés Ivaldi  wrote:

> Jonathan thanks for the link,
> I believe that maybe is good as Data Store part, because is fast for I/o
> and handles Time Series, for analytics could be with Apache Ignite and/or
> Apache Spark
> what it worries me is that looks very complex create the structure for
> each Fact table and then extend
>
> regards.
>
> On Sun, Feb 28, 2016 at 12:28 PM, Jonathan Haddad 
> wrote:
>
>> Cassandra is primarily used as an OLTP database, not analytics. You
>> should watch this 30 min video discussing Cassandra core concepts (coming
>> from a relational background):
>> https://academy.datastax.com/courses/ds101-introduction-cassandra
>>
>> On Sun, Feb 28, 2016 at 5:40 AM Andrés Ivaldi  wrote:
>>
>>> Hello, At my work we are looking for new technologies for an Analysis
>>> Engine, and we are evaluating differents technologies one of them is
>>> Cassandra as our Data repository.
>>>
>>> Now we can execute query analysis agains an OLAP Cube and RDBMS, using
>>> MSSQL as our data repository. Cube is obsolete and SQL server engine is
>>> slow as data repository.
>>>
>>> I don't know much about cassandra, I read some books, and looks to fit
>>> well on what we are needing, but there are some things that looks like a
>>> problem for us.
>>>
>>> Our engine is designed to be scalable, flexible and dynamic, any user
>>> can add new dimensions or measures from any source, all the data is stored
>>> on Cube(this is fixed data) and MSSQL(dynamic data) so we have decoupled
>>> tables with the dimension values.
>>>
>>>
>>> Ok, with the context given I'll like to clear some doubts
>>>
>>> - I able to flat the table with all the possible dimension values to
>>> cassandra, creating the pk against the dimension columns? this will give me
>>> the "sensation" of data pivot over the PK columns? If correct, what if I
>>> want to select the order of the columns, or add another or reduce them?
>>> - It's possible to extend the values of a row dynamically? What we do
>>> often is join row against a value of a mapped external data value to extend
>>> the dimensions hierarchical value structure (ie state->Country->Continent)
>>>
>>> I know we can do some of this things in the core of our engine, like the
>>> dimension extension of the values or reduce columns, but as we are
>>> evaluating differents technologies is good to know.
>>>
>>> Regards!!
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>
>
> --
> Ing. Ivaldi Andres
>


Re: Cassandra Ussages

2016-03-01 Thread Andrés Ivaldi
Jonathan thanks for the link,
I believe that maybe is good as Data Store part, because is fast for I/o
and handles Time Series, for analytics could be with Apache Ignite and/or
Apache Spark
what it worries me is that looks very complex create the structure for each
Fact table and then extend

regards.

On Sun, Feb 28, 2016 at 12:28 PM, Jonathan Haddad  wrote:

> Cassandra is primarily used as an OLTP database, not analytics. You should
> watch this 30 min video discussing Cassandra core concepts (coming from a
> relational background):
> https://academy.datastax.com/courses/ds101-introduction-cassandra
>
> On Sun, Feb 28, 2016 at 5:40 AM Andrés Ivaldi  wrote:
>
>> Hello, At my work we are looking for new technologies for an Analysis
>> Engine, and we are evaluating differents technologies one of them is
>> Cassandra as our Data repository.
>>
>> Now we can execute query analysis agains an OLAP Cube and RDBMS, using
>> MSSQL as our data repository. Cube is obsolete and SQL server engine is
>> slow as data repository.
>>
>> I don't know much about cassandra, I read some books, and looks to fit
>> well on what we are needing, but there are some things that looks like a
>> problem for us.
>>
>> Our engine is designed to be scalable, flexible and dynamic, any user can
>> add new dimensions or measures from any source, all the data is stored on
>> Cube(this is fixed data) and MSSQL(dynamic data) so we have decoupled
>> tables with the dimension values.
>>
>>
>> Ok, with the context given I'll like to clear some doubts
>>
>> - I able to flat the table with all the possible dimension values to
>> cassandra, creating the pk against the dimension columns? this will give me
>> the "sensation" of data pivot over the PK columns? If correct, what if I
>> want to select the order of the columns, or add another or reduce them?
>> - It's possible to extend the values of a row dynamically? What we do
>> often is join row against a value of a mapped external data value to extend
>> the dimensions hierarchical value structure (ie state->Country->Continent)
>>
>> I know we can do some of this things in the core of our engine, like the
>> dimension extension of the values or reduce columns, but as we are
>> evaluating differents technologies is good to know.
>>
>> Regards!!
>>
>>
>> --
>> Ing. Ivaldi Andres
>>
>


-- 
Ing. Ivaldi Andres


Re: Practical limit on number of column families

2016-03-01 Thread Jack Krupansky
I don't think Cassandra was "purposefully developed" for some target number
of tables - there is no evidence of any such an explicit intent. Instead,
it would be fair to say that Cassandra was "not purposefully developed"
with a goal of supporting "large numbers of tables." Sometimes features and
capabilities come for free or as a side effect of the technologies used,
but usually specific features and specific capabilities (such as large
numbers of tables) require explicit intent and explicit effort.

One could indeed endeavor to design a data store (I'm not even sure it
would still be considered a database per se) that supported either large
numbers of tables or an additional level of storage model in between table
and row (call it "group" maybe or "sub-table".) But obviously Cassandra was
not designed with that goal in mind.

Traditionally, a "table" is a defined relation over a set of data. Relation
and data are distinct concepts. And a relation name is not simply a
Java-style "object". A relation (table) name is supposed to represent an
abstraction or entity type, while essentially all of the cases I have heard
of for wanting thousands (or even hundreds) of tables are trying to use
table as more of a container for a group of rows for a specific entity
instance rather than a distinct entity type. Granted, Cassandra is not
obligated to be limited to the relational model, but Cassandra, especially
CQL, is intentionally modeled reasonably closely with the relational model
in terms of the data modeling abstractions even though the storage engine
is designed to scale across nodes.

You could file a Jira requesting such a feature improvement. And then we
would see if sentiment has shifted over the years.

The key thing is to offer up a use case that warrants support for large
numbers of tables. So far, it has usually been the case that the perceived
need for separate tables could easily be met using clustering columns of a
single table.

Seriously, if you guys can define a legitimate use case that can't easily
be handled by a single table, that could get the discussion started.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 9:11 AM, Fernando Jimenez <
fernando.jime...@wealth-port.com> wrote:

> Hi Jack
>
> Being purposefully developed to only handle up to “a few hundred” tables
> is reason enough. I accept that, and likely a use case with many tables was
> never really considered. But I would still like to understand the design
> choices made so perhaps we gain some confidence level in this upper limit
> in the number of tables. The best estimate we have so far is “a few
> hundred” which is a bit vague.
>
> Regarding scaling, I’m not talking about scaling in terms of data volume,
> but on how the data is structured. One thousand tables with one row each is
> the same data volume as one table with one thousand rows, excluding any
> data structures required to maintain the extra tables. But whereas the
> first seems likely to bring a Cassandra cluster to its knees, the second
> will run happily on a single node cluster in a low end machine.
>
> We will design our code to use a single table to avoid having nightmares
> with this issue. But if there is any authoritative documentation on this
> characteristic of Cassandra, I would love to know more.
>
> FJ
>
>
> On 01 Mar 2016, at 14:23, Jack Krupansky  wrote:
>
> I don't think there are any "reasons behind it." It is simply empirical
> experience - as reported here.
>
> Cassandra scales in two dimension - number of rows per node and number of
> nodes. If some source of information lead you to believe otherwise, please
> point out the source so that we can endeavor to correct it.
>
> The exact number of rows per node and tables per node will always have to
> be evaluated empirically - a proof of concept implementation, since it all
> depends on the mix of capabilities of your hardware combined with your
> specific data model, your specific data values, your specific access
> patterns, and your specific load. And it also depends on your own personal
> tolerance for degradation of latency and throughput - some people might
> find a given set of performance  metrics acceptable while other might not.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi Tommaso
>>
>> It’s not that I _need_ a large number of tables. This approach maps
>> easily to the problem we are trying to solve, but it’s becoming clear it’s
>> not the right approach.
>>
>> At the moment I’m trying to understand the limitations in Cassandra
>> regarding number of Tables and the reasons behind it. I’ve come to the
>> email list as my Google-foo is not giving me what I’m looking for :(
>>
>> FJ
>>
>>
>>
>> On 01 Mar 2016, at 09:36, tommaso barbugli  wrote:
>>
>> Hi Fernando,
>>
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was
>> a real pain in terms of 

Re: Disable writing to debug.log

2016-03-01 Thread Michael Mior
There are instructions given /etc/cassandra/logback.xml



Looking later in the file, you'll see the following:

  

  
1024
0
true

  

Commenting out this section will disable writing to debug.log.

--
Michael Mior
mm...@uwaterloo.ca

2016-03-01 10:43 GMT-05:00 Rakesh Kumar :

> Version: Cassandra 3.3
>
> Can anyone tell on how to disable writing to debug.log.
>
> thanks.
>


Disable writing to debug.log

2016-03-01 Thread Rakesh Kumar
Version: Cassandra 3.3


Can anyone tell on how to disable writing to debug.log.


thanks.


Re: Practical limit on number of column families

2016-03-01 Thread Vlad
>If your Jira search fu is strong enoughAnd it is! )

>you should be able to find it yourselfAnd I did! )
I see that this issue originates to problem with Java GC's design, but 
according to date it was Java 6 time. Now we have J8 with new  GC mechanism.
Is this problem still exists with J8? Any chances to use original method to 
reduce overhead and "be happy with the results"?
Regards, Vlad
 

On Tuesday, March 1, 2016 4:07 PM, Jack Krupansky 
 wrote:
 

 I'll defer to one of the senior committers as to whether they want that 
information disseminated any further than it already is. It was intentionally 
not documented since it is not recommended. If your Jira search fu is strong 
enough you should be able to find it yourself, but again, its use is strongly 
not recommended.
As the Jira notes, "having more than dozens or hundreds of tables defined is 
almost certainly a Bad Idea."
"Bad Idea" means not good. As in don't go there. And if you do, don't expect 
such a mis-adventure to be supported by the community.
-- Jack Krupansky
On Tue, Mar 1, 2016 at 8:39 AM, Vlad  wrote:

Hi Jack,
>you can reduce the overhead per table  an undocumented Jira Can you please 
>point to this Jira number?
 
>it is strongly not recommendedWhat is consequences of this (besides 
>performance degradation, if any)?
Thanks.


On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky 
 wrote:
 

 3,000 entries? What's an "entry"? Do you mean row, column, or... what?

You are using the obsolete terminology of CQL2 and Thrift - column family. With 
CQL3 you should be creating "tables". The practical recommendation of an upper 
limit of a few hundred tables across all key spaces remains.
Technically you can go higher and technically you can reduce the overhead per 
table (an undocumented Jira - intentionally undocumented since it is strongly 
not recommended), but... it is unlikely that you will be happy with the results.
What is the nature of the use case?
You basically have two choices: an additional cluster column to distinguish 
categories of table, or separate clusters for each few hundred of tables.

-- Jack Krupansky
On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
 wrote:

Hi all
I have a use case for Cassandra that would require creating a large number of 
column families. I have found references to early versions of Cassandra where 
each column family would require a fixed amount of memory on all nodes, 
effectively imposing an upper limit on the total number of CFs. I have also 
seen rumblings that this may have been fixed in later versions.
To put the question to rest, I have setup a DSE sandbox and created some code 
to generate column families populated with 3,000 entries each.
Unfortunately I have now hit this issue: 
https://issues.apache.org/jira/browse/CASSANDRA-9291
So I will have to retest against Cassandra 3.0 instead
However, I would like to understand the limitations regarding creation of 
column families. 
 * Is there a practical upper limit?  * is this a fixed limit, or does it scale 
as more nodes are added into the cluster?  * Is there a difference between one 
keyspace with thousands of column families, vs thousands of keyspaces with only 
a few column families each?
I haven’t found any hard evidence/documentation to help me here, but if you can 
point me in the right direction, I will oblige and RTFM away.
Many thanks for your help!
CheersFJ






   



  

Re: Practical limit on number of column families

2016-03-01 Thread Fernando Jimenez
Hi Jack

Being purposefully developed to only handle up to “a few hundred” tables is 
reason enough. I accept that, and likely a use case with many tables was never 
really considered. But I would still like to understand the design choices made 
so perhaps we gain some confidence level in this upper limit in the number of 
tables. The best estimate we have so far is “a few hundred” which is a bit 
vague. 

Regarding scaling, I’m not talking about scaling in terms of data volume, but 
on how the data is structured. One thousand tables with one row each is the 
same data volume as one table with one thousand rows, excluding any data 
structures required to maintain the extra tables. But whereas the first seems 
likely to bring a Cassandra cluster to its knees, the second will run happily 
on a single node cluster in a low end machine.

We will design our code to use a single table to avoid having nightmares with 
this issue. But if there is any authoritative documentation on this 
characteristic of Cassandra, I would love to know more.

FJ


> On 01 Mar 2016, at 14:23, Jack Krupansky  wrote:
> 
> I don't think there are any "reasons behind it." It is simply empirical 
> experience - as reported here.
> 
> Cassandra scales in two dimension - number of rows per node and number of 
> nodes. If some source of information lead you to believe otherwise, please 
> point out the source so that we can endeavor to correct it.
> 
> The exact number of rows per node and tables per node will always have to be 
> evaluated empirically - a proof of concept implementation, since it all 
> depends on the mix of capabilities of your hardware combined with your 
> specific data model, your specific data values, your specific access 
> patterns, and your specific load. And it also depends on your own personal 
> tolerance for degradation of latency and throughput - some people might find 
> a given set of performance  metrics acceptable while other might not.
> 
> -- Jack Krupansky
> 
> On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez 
> > 
> wrote:
> Hi Tommaso
> 
> It’s not that I _need_ a large number of tables. This approach maps easily to 
> the problem we are trying to solve, but it’s becoming clear it’s not the 
> right approach.
> 
> At the moment I’m trying to understand the limitations in Cassandra regarding 
> number of Tables and the reasons behind it. I’ve come to the email list as my 
> Google-foo is not giving me what I’m looking for :(
> 
> FJ
> 
> 
> 
>> On 01 Mar 2016, at 09:36, tommaso barbugli > > wrote:
>> 
>> Hi Fernando,
>> 
>> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a 
>> real pain in terms of operations. Repairs were terribly slow, boot of C* 
>> slowed down and in general tracking table metrics becomes bit more work. Why 
>> do you need this high number of tables?
>> 
>> Tommaso
>> 
>> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez 
>> > 
>> wrote:
>> Hi Jack
>> 
>> By entry I mean row
>> 
>> Apologies for the “obsolete terminology”. When I first looked at Cassandra 
>> it was still on CQL2, and now that I’m looking at it again I’ve defaulted to 
>> the terms I already knew. I will bear it in mind and call them tables from 
>> now on.
>> 
>> Is there any documentation about this limit? for example, I’d be keen to 
>> know how much memory is consumed per table, and I’m also curious about the 
>> reasons for keeping this in memory. I’m trying to understand the limitations 
>> here, rather than challenge them.
>> 
>> So far I found nothing in my search, hence why I had to resort to some “load 
>> testing” to see what happens when you push the table count high
>> 
>> Thanks
>> FJ
>> 
>> 
>>> On 01 Mar 2016, at 06:23, Jack Krupansky >> > wrote:
>>> 
>>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>> 
>>> You are using the obsolete terminology of CQL2 and Thrift - column family. 
>>> With CQL3 you should be creating "tables". The practical recommendation of 
>>> an upper limit of a few hundred tables across all key spaces remains.
>>> 
>>> Technically you can go higher and technically you can reduce the overhead 
>>> per table (an undocumented Jira - intentionally undocumented since it is 
>>> strongly not recommended), but... it is unlikely that you will be happy 
>>> with the results.
>>> 
>>> What is the nature of the use case?
>>> 
>>> You basically have two choices: an additional cluster column to distinguish 
>>> categories of table, or separate clusters for each few hundred of tables.
>>> 
>>> 
>>> -- Jack Krupansky
>>> 
>>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
>>> >> 

Re: Practical limit on number of column families

2016-03-01 Thread Jack Krupansky
I'll defer to one of the senior committers as to whether they want that
information disseminated any further than it already is. It was
intentionally not documented since it is not recommended. If your Jira
search fu is strong enough you should be able to find it yourself, but
again, its use is strongly not recommended.

As the Jira notes, "having more than dozens or hundreds of tables defined
is almost certainly a Bad Idea."

"Bad Idea" means not good. As in don't go there. And if you do, don't
expect such a mis-adventure to be supported by the community.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 8:39 AM, Vlad  wrote:

> Hi Jack,
>
> >you can reduce the overhead per table  an undocumented Jira
> Can you please point to this Jira number?
>
> >it is strongly not recommended
> What is consequences of this (besides performance degradation, if any)?
>
> Thanks.
>
>
> On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky <
> jack.krupan...@gmail.com> wrote:
>
>
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>
> You are using the obsolete terminology of CQL2 and Thrift - column family.
> With CQL3 you should be creating "tables". The practical recommendation of
> an upper limit of a few hundred tables across all key spaces remains.
>
> Technically you can go higher and technically you can reduce the overhead
> per table (an undocumented Jira - intentionally undocumented since it is
> strongly not recommended), but... it is unlikely that you will be happy
> with the results.
>
> What is the nature of the use case?
>
> You basically have two choices: an additional cluster column to
> distinguish categories of table, or separate clusters for each few hundred
> of tables.
>
>
> -- Jack Krupansky
>
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
> Hi all
>
> I have a use case for Cassandra that would require creating a large number
> of column families. I have found references to early versions of Cassandra
> where each column family would require a fixed amount of memory on all
> nodes, effectively imposing an upper limit on the total number of CFs. I
> have also seen rumblings that this may have been fixed in later versions.
>
> To put the question to rest, I have setup a DSE sandbox and created some
> code to generate column families populated with 3,000 entries each.
>
> Unfortunately I have now hit this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-9291
>
> So I will have to retest against Cassandra 3.0 instead
>
> However, I would like to understand the limitations regarding creation of
> column families.
>
> * Is there a practical upper limit?
> * is this a fixed limit, or does it scale as more nodes are added into the
> cluster?
> * Is there a difference between one keyspace with thousands of column
> families, vs thousands of keyspaces with only a few column families each?
>
> I haven’t found any hard evidence/documentation to help me here, but if
> you can point me in the right direction, I will oblige and RTFM away.
>
> Many thanks for your help!
>
> Cheers
> FJ
>
>
>
>
>
>


Re: Practical limit on number of column families

2016-03-01 Thread Vlad
Hi Jack,
>you can reduce the overhead per table  an undocumented Jira Can you please 
>point to this Jira number?
 
>it is strongly not recommendedWhat is consequences of this (besides 
>performance degradation, if any)?
Thanks.


On Tuesday, March 1, 2016 7:23 AM, Jack Krupansky 
 wrote:
 

 3,000 entries? What's an "entry"? Do you mean row, column, or... what?

You are using the obsolete terminology of CQL2 and Thrift - column family. With 
CQL3 you should be creating "tables". The practical recommendation of an upper 
limit of a few hundred tables across all key spaces remains.
Technically you can go higher and technically you can reduce the overhead per 
table (an undocumented Jira - intentionally undocumented since it is strongly 
not recommended), but... it is unlikely that you will be happy with the results.
What is the nature of the use case?
You basically have two choices: an additional cluster column to distinguish 
categories of table, or separate clusters for each few hundred of tables.

-- Jack Krupansky
On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
 wrote:

Hi all
I have a use case for Cassandra that would require creating a large number of 
column families. I have found references to early versions of Cassandra where 
each column family would require a fixed amount of memory on all nodes, 
effectively imposing an upper limit on the total number of CFs. I have also 
seen rumblings that this may have been fixed in later versions.
To put the question to rest, I have setup a DSE sandbox and created some code 
to generate column families populated with 3,000 entries each.
Unfortunately I have now hit this issue: 
https://issues.apache.org/jira/browse/CASSANDRA-9291
So I will have to retest against Cassandra 3.0 instead
However, I would like to understand the limitations regarding creation of 
column families. 
 * Is there a practical upper limit?  * is this a fixed limit, or does it scale 
as more nodes are added into the cluster?  * Is there a difference between one 
keyspace with thousands of column families, vs thousands of keyspaces with only 
a few column families each?
I haven’t found any hard evidence/documentation to help me here, but if you can 
point me in the right direction, I will oblige and RTFM away.
Many thanks for your help!
CheersFJ






  

Re: Practical limit on number of column families

2016-03-01 Thread Jack Krupansky
I don't think there are any "reasons behind it." It is simply empirical
experience - as reported here.

Cassandra scales in two dimension - number of rows per node and number of
nodes. If some source of information lead you to believe otherwise, please
point out the source so that we can endeavor to correct it.

The exact number of rows per node and tables per node will always have to
be evaluated empirically - a proof of concept implementation, since it all
depends on the mix of capabilities of your hardware combined with your
specific data model, your specific data values, your specific access
patterns, and your specific load. And it also depends on your own personal
tolerance for degradation of latency and throughput - some people might
find a given set of performance  metrics acceptable while other might not.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 3:54 AM, Fernando Jimenez <
fernando.jime...@wealth-port.com> wrote:

> Hi Tommaso
>
> It’s not that I _need_ a large number of tables. This approach maps easily
> to the problem we are trying to solve, but it’s becoming clear it’s not the
> right approach.
>
> At the moment I’m trying to understand the limitations in Cassandra
> regarding number of Tables and the reasons behind it. I’ve come to the
> email list as my Google-foo is not giving me what I’m looking for :(
>
> FJ
>
>
>
> On 01 Mar 2016, at 09:36, tommaso barbugli  wrote:
>
> Hi Fernando,
>
> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a
> real pain in terms of operations. Repairs were terribly slow, boot of C*
> slowed down and in general tracking table metrics becomes bit more work.
> Why do you need this high number of tables?
>
> Tommaso
>
> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi Jack
>>
>> By entry I mean row
>>
>> Apologies for the “obsolete terminology”. When I first looked at
>> Cassandra it was still on CQL2, and now that I’m looking at it again I’ve
>> defaulted to the terms I already knew. I will bear it in mind and call them
>> tables from now on.
>>
>> Is there any documentation about this limit? for example, I’d be keen to
>> know how much memory is consumed per table, and I’m also curious about the
>> reasons for keeping this in memory. I’m trying to understand the
>> limitations here, rather than challenge them.
>>
>> So far I found nothing in my search, hence why I had to resort to some
>> “load testing” to see what happens when you push the table count high
>>
>> Thanks
>> FJ
>>
>>
>> On 01 Mar 2016, at 06:23, Jack Krupansky 
>> wrote:
>>
>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>>
>> You are using the obsolete terminology of CQL2 and Thrift - column
>> family. With CQL3 you should be creating "tables". The practical
>> recommendation of an upper limit of a few hundred tables across all key
>> spaces remains.
>>
>> Technically you can go higher and technically you can reduce the overhead
>> per table (an undocumented Jira - intentionally undocumented since it is
>> strongly not recommended), but... it is unlikely that you will be happy
>> with the results.
>>
>> What is the nature of the use case?
>>
>> You basically have two choices: an additional cluster column to
>> distinguish categories of table, or separate clusters for each few hundred
>> of tables.
>>
>>
>> -- Jack Krupansky
>>
>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
>> fernando.jime...@wealth-port.com> wrote:
>>
>>> Hi all
>>>
>>> I have a use case for Cassandra that would require creating a large
>>> number of column families. I have found references to early versions of
>>> Cassandra where each column family would require a fixed amount of memory
>>> on all nodes, effectively imposing an upper limit on the total number of
>>> CFs. I have also seen rumblings that this may have been fixed in later
>>> versions.
>>>
>>> To put the question to rest, I have setup a DSE sandbox and created some
>>> code to generate column families populated with 3,000 entries each.
>>>
>>> Unfortunately I have now hit this issue:
>>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>>
>>> So I will have to retest against Cassandra 3.0 instead
>>>
>>> However, I would like to understand the limitations regarding creation
>>> of column families.
>>>
>>> * Is there a practical upper limit?
>>> * is this a fixed limit, or does it scale as more nodes are added into
>>> the cluster?
>>> * Is there a difference between one keyspace with thousands of column
>>> families, vs thousands of keyspaces with only a few column families each?
>>>
>>> I haven’t found any hard evidence/documentation to help me here, but if
>>> you can point me in the right direction, I will oblige and RTFM away.
>>>
>>> Many thanks for your help!
>>>
>>> Cheers
>>> FJ
>>>
>>>
>>>
>>
>>
>
>


Commit log size vs memtable total size

2016-03-01 Thread Vlad
Hi,there are following parameters in casansdra.yaml:
memtable_total_space_in_mb (1/4 of heap, e.g. 512MB)- Specifies the total 
memory used for all memtables on a node.
commitlog_total_space_in_mb (8GB) - Total space used for commit logs. If the 
used space goes above this value, Cassandra rounds up to the next nearest 
segment multiple and flushes memtables to disk for the  oldest commitlog 
segments, removing those log segments.
My question is what is meaning of commit log size being much more than 
memtables size.
>From manual: "Cassandra flushes memtables to disk, creating SSTables when the 
>commit log space threshold has been exceeded.", but as far as I understand 
>memtables are also flushed when too much memtables are used and anyway 
>unflashed size can't be more than memtable size in memory.  So commit log 
>can't keep more than memtable size, why is difference in commit log and 
>memtables sizes?

Regards, Vlad




Re: Practical limit on number of column families

2016-03-01 Thread Fernando Jimenez
Hi Tommaso

It’s not that I _need_ a large number of tables. This approach maps easily to 
the problem we are trying to solve, but it’s becoming clear it’s not the right 
approach.

At the moment I’m trying to understand the limitations in Cassandra regarding 
number of Tables and the reasons behind it. I’ve come to the email list as my 
Google-foo is not giving me what I’m looking for :(

FJ



> On 01 Mar 2016, at 09:36, tommaso barbugli  wrote:
> 
> Hi Fernando,
> 
> I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a 
> real pain in terms of operations. Repairs were terribly slow, boot of C* 
> slowed down and in general tracking table metrics becomes bit more work. Why 
> do you need this high number of tables?
> 
> Tommaso
> 
> On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez 
> > 
> wrote:
> Hi Jack
> 
> By entry I mean row
> 
> Apologies for the “obsolete terminology”. When I first looked at Cassandra it 
> was still on CQL2, and now that I’m looking at it again I’ve defaulted to the 
> terms I already knew. I will bear it in mind and call them tables from now on.
> 
> Is there any documentation about this limit? for example, I’d be keen to know 
> how much memory is consumed per table, and I’m also curious about the reasons 
> for keeping this in memory. I’m trying to understand the limitations here, 
> rather than challenge them.
> 
> So far I found nothing in my search, hence why I had to resort to some “load 
> testing” to see what happens when you push the table count high
> 
> Thanks
> FJ
> 
> 
>> On 01 Mar 2016, at 06:23, Jack Krupansky > > wrote:
>> 
>> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>> 
>> You are using the obsolete terminology of CQL2 and Thrift - column family. 
>> With CQL3 you should be creating "tables". The practical recommendation of 
>> an upper limit of a few hundred tables across all key spaces remains.
>> 
>> Technically you can go higher and technically you can reduce the overhead 
>> per table (an undocumented Jira - intentionally undocumented since it is 
>> strongly not recommended), but... it is unlikely that you will be happy with 
>> the results.
>> 
>> What is the nature of the use case?
>> 
>> You basically have two choices: an additional cluster column to distinguish 
>> categories of table, or separate clusters for each few hundred of tables.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
>> > 
>> wrote:
>> Hi all
>> 
>> I have a use case for Cassandra that would require creating a large number 
>> of column families. I have found references to early versions of Cassandra 
>> where each column family would require a fixed amount of memory on all 
>> nodes, effectively imposing an upper limit on the total number of CFs. I 
>> have also seen rumblings that this may have been fixed in later versions.
>> 
>> To put the question to rest, I have setup a DSE sandbox and created some 
>> code to generate column families populated with 3,000 entries each.
>> 
>> Unfortunately I have now hit this issue: 
>> https://issues.apache.org/jira/browse/CASSANDRA-9291 
>> 
>> 
>> So I will have to retest against Cassandra 3.0 instead
>> 
>> However, I would like to understand the limitations regarding creation of 
>> column families. 
>> 
>>  * Is there a practical upper limit? 
>>  * is this a fixed limit, or does it scale as more nodes are added into 
>> the cluster? 
>>  * Is there a difference between one keyspace with thousands of column 
>> families, vs thousands of keyspaces with only a few column families each?
>> 
>> I haven’t found any hard evidence/documentation to help me here, but if you 
>> can point me in the right direction, I will oblige and RTFM away.
>> 
>> Many thanks for your help!
>> 
>> Cheers
>> FJ
>> 
>> 
>> 
> 
> 



Re: Practical limit on number of column families

2016-03-01 Thread tommaso barbugli
Hi Fernando,

I used to have a cluster with ~300 tables (1 keyspace) on C* 2.0, it was a
real pain in terms of operations. Repairs were terribly slow, boot of C*
slowed down and in general tracking table metrics becomes bit more work.
Why do you need this high number of tables?

Tommaso

On Tue, Mar 1, 2016 at 9:16 AM, Fernando Jimenez <
fernando.jime...@wealth-port.com> wrote:

> Hi Jack
>
> By entry I mean row
>
> Apologies for the “obsolete terminology”. When I first looked at Cassandra
> it was still on CQL2, and now that I’m looking at it again I’ve defaulted
> to the terms I already knew. I will bear it in mind and call them tables
> from now on.
>
> Is there any documentation about this limit? for example, I’d be keen to
> know how much memory is consumed per table, and I’m also curious about the
> reasons for keeping this in memory. I’m trying to understand the
> limitations here, rather than challenge them.
>
> So far I found nothing in my search, hence why I had to resort to some
> “load testing” to see what happens when you push the table count high
>
> Thanks
> FJ
>
>
> On 01 Mar 2016, at 06:23, Jack Krupansky  wrote:
>
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
>
> You are using the obsolete terminology of CQL2 and Thrift - column family.
> With CQL3 you should be creating "tables". The practical recommendation of
> an upper limit of a few hundred tables across all key spaces remains.
>
> Technically you can go higher and technically you can reduce the overhead
> per table (an undocumented Jira - intentionally undocumented since it is
> strongly not recommended), but... it is unlikely that you will be happy
> with the results.
>
> What is the nature of the use case?
>
> You basically have two choices: an additional cluster column to
> distinguish categories of table, or separate clusters for each few hundred
> of tables.
>
>
> -- Jack Krupansky
>
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez <
> fernando.jime...@wealth-port.com> wrote:
>
>> Hi all
>>
>> I have a use case for Cassandra that would require creating a large
>> number of column families. I have found references to early versions of
>> Cassandra where each column family would require a fixed amount of memory
>> on all nodes, effectively imposing an upper limit on the total number of
>> CFs. I have also seen rumblings that this may have been fixed in later
>> versions.
>>
>> To put the question to rest, I have setup a DSE sandbox and created some
>> code to generate column families populated with 3,000 entries each.
>>
>> Unfortunately I have now hit this issue:
>> https://issues.apache.org/jira/browse/CASSANDRA-9291
>>
>> So I will have to retest against Cassandra 3.0 instead
>>
>> However, I would like to understand the limitations regarding creation of
>> column families.
>>
>> * Is there a practical upper limit?
>> * is this a fixed limit, or does it scale as more nodes are added into
>> the cluster?
>> * Is there a difference between one keyspace with thousands of column
>> families, vs thousands of keyspaces with only a few column families each?
>>
>> I haven’t found any hard evidence/documentation to help me here, but if
>> you can point me in the right direction, I will oblige and RTFM away.
>>
>> Many thanks for your help!
>>
>> Cheers
>> FJ
>>
>>
>>
>
>


Re: Practical limit on number of column families

2016-03-01 Thread Fernando Jimenez
Hi Jack

By entry I mean row

Apologies for the “obsolete terminology”. When I first looked at Cassandra it 
was still on CQL2, and now that I’m looking at it again I’ve defaulted to the 
terms I already knew. I will bear it in mind and call them tables from now on.

Is there any documentation about this limit? for example, I’d be keen to know 
how much memory is consumed per table, and I’m also curious about the reasons 
for keeping this in memory. I’m trying to understand the limitations here, 
rather than challenge them.

So far I found nothing in my search, hence why I had to resort to some “load 
testing” to see what happens when you push the table count high

Thanks
FJ


> On 01 Mar 2016, at 06:23, Jack Krupansky  wrote:
> 
> 3,000 entries? What's an "entry"? Do you mean row, column, or... what?
> 
> You are using the obsolete terminology of CQL2 and Thrift - column family. 
> With CQL3 you should be creating "tables". The practical recommendation of an 
> upper limit of a few hundred tables across all key spaces remains.
> 
> Technically you can go higher and technically you can reduce the overhead per 
> table (an undocumented Jira - intentionally undocumented since it is strongly 
> not recommended), but... it is unlikely that you will be happy with the 
> results.
> 
> What is the nature of the use case?
> 
> You basically have two choices: an additional cluster column to distinguish 
> categories of table, or separate clusters for each few hundred of tables.
> 
> 
> -- Jack Krupansky
> 
> On Mon, Feb 29, 2016 at 12:30 PM, Fernando Jimenez 
> > 
> wrote:
> Hi all
> 
> I have a use case for Cassandra that would require creating a large number of 
> column families. I have found references to early versions of Cassandra where 
> each column family would require a fixed amount of memory on all nodes, 
> effectively imposing an upper limit on the total number of CFs. I have also 
> seen rumblings that this may have been fixed in later versions.
> 
> To put the question to rest, I have setup a DSE sandbox and created some code 
> to generate column families populated with 3,000 entries each.
> 
> Unfortunately I have now hit this issue: 
> https://issues.apache.org/jira/browse/CASSANDRA-9291 
> 
> 
> So I will have to retest against Cassandra 3.0 instead
> 
> However, I would like to understand the limitations regarding creation of 
> column families. 
> 
>   * Is there a practical upper limit? 
>   * is this a fixed limit, or does it scale as more nodes are added into 
> the cluster? 
>   * Is there a difference between one keyspace with thousands of column 
> families, vs thousands of keyspaces with only a few column families each?
> 
> I haven’t found any hard evidence/documentation to help me here, but if you 
> can point me in the right direction, I will oblige and RTFM away.
> 
> Many thanks for your help!
> 
> Cheers
> FJ
> 
> 
>