Re: Cassandra Needs to Grow Up by Version Five!

2018-02-19 Thread James Briggs
Kenneth:
What you said is not wrong.

Vertica and Riak are examples of distributed databases that don't require 
hand-holding.

Cassandra is for Java-programmer DIYers, or more often Datastax clients, at 
this point.
Thanks, James.

  From: Kenneth Brotman 
 To: user@cassandra.apache.org 
Cc: d...@cassandra.apache.org
 Sent: Monday, February 19, 2018 4:56 PM
 Subject: RE: Cassandra Needs to Grow Up by Version Five!
   
#yiv0297673896 #yiv0297673896 -- _filtered #yiv0297673896 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv0297673896 
{font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}#yiv0297673896 
#yiv0297673896 p.yiv0297673896MsoNormal, #yiv0297673896 
li.yiv0297673896MsoNormal, #yiv0297673896 div.yiv0297673896MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0297673896 a:link, 
#yiv0297673896 span.yiv0297673896MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv0297673896 a:visited, #yiv0297673896 
span.yiv0297673896MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv0297673896 
p.yiv0297673896MsoAcetate, #yiv0297673896 li.yiv0297673896MsoAcetate, 
#yiv0297673896 div.yiv0297673896MsoAcetate 
{margin:0in;margin-bottom:.0001pt;font-size:8.0pt;}#yiv0297673896 
span.yiv0297673896EmailStyle17 {color:#1F497D;}#yiv0297673896 
span.yiv0297673896BalloonTextChar {}#yiv0297673896 
span.yiv0297673896EmailStyle20 {color:#1F497D;}#yiv0297673896 
.yiv0297673896MsoChpDefault {font-size:10.0pt;} _filtered #yiv0297673896 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv0297673896 div.yiv0297673896WordSection1 
{}#yiv0297673896 Jeff, you helped me figure out what I was missing.  It just 
took me a day to digest what you wrote.  I’m coming over from another type of 
engineering.  I didn’t know and it’s not really documented.  Cassandra runs in 
a data center.  Now days that means the nodes are going to be in managed 
containers, Docker containers, managed by Kerbernetes,  Meso or something, and 
for that reason anyone operating Cassandra in a real world setting would not 
encounter the issues I raised in the way I described.  Shouldn’t the 
architectural diagrams people reference indicate that in some way?  That would 
have help me.  Kenneth Brotman  From: Kenneth Brotman 
[mailto:kenbrot...@yahoo.com] 
Sent: Monday, February 19, 2018 10:43 AM
To: 'user@cassandra.apache.org'
Cc: 'd...@cassandra.apache.org'
Subject: RE: Cassandra Needs to Grow Up by Version Five!  Well said.  Very 
fair.  I wouldn’t mind hearing from others still.  You’re a good guy!  Kenneth 
Brotman  From: Jeff Jirsa [mailto:jji...@gmail.com] 
Sent: Monday, February 19, 2018 9:10 AM
To: cassandra
Cc: Cassandra DEV
Subject: Re: Cassandra Needs to Grow Up by Version Five!  There's a lot of 
things below I disagree with, but it's ok. I convinced myself not to nit-pick 
every point.  https://issues.apache.org/jira/browse/CASSANDRA-13971 has some of 
Stefan's work with cert management  Beyond that, I encourage you to do what 
Michael suggested: open JIRAs for things you care strongly about, work on them 
if you have time. Sometime this year we'll schedule a NGCC (Next Generation 
Cassandra Conference) where we talk about future project work and direction, I 
encourage you to attend if you're able (I encourage anyone who cares about the 
direction of Cassandra to attend, it's probably be either free or very low 
cost, just to cover a venue and some food). If nothing else, you'll meet some 
of the teams who are working on the project, and learn why they've selected the 
projects on which they're working. You'll have an opportunity to pitch your 
vision, and maybe you can talk some folks into helping out.   - Jeff        On 
Mon, Feb 19, 2018 at 1:01 AM, Kenneth Brotman  
wrote:Comments inline

>-Original Message-
>From: Jeff Jirsa [mailto:jji...@gmail.com]
>Sent: Sunday, February 18, 2018 10:58 PM
>To: user@cassandra.apache.org
>Cc: d...@cassandra.apache.org
>Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
>Comments inline
>
>
>> On Feb 18, 2018, at 9:39 PM, Kenneth Brotman  
>> wrote:
>>
> >Cassandra feels like an unfinished program to me. The problem is not that 
> >it’s open source or cutting edge.  It’s an open source cutting edge program 
> >that lacks some of its basic functionality.  We are all stuck addressing 
> >fundamental mechanical tasks for Cassandra because the basic code that would 
> >do that part has not been contributed yet.
>>
>There’s probably 2-3 reasons why here:
>
>1) Historically the pmc has tried to keep the scope of the project very 
>narrow. It’s a database. We don’t ship drivers. We don’t ship developer tools. 
>We don’t ship fancy UIs. We ship a database. I think for the most part the 
>narrow vision has been for the best, but maybe it’s time to reconsider some of 
>the scope.
>
>Postgres will autovacuum to prevent wraparound (hopefully),  but everyone I 
>know running 

Re: newbie , to use cassandra when query is arbitrary?

2018-02-19 Thread Rajesh Kishore
Hi Rahul,

I cannot confirm the size wrt Cassandra, but usually in berkley db for *10
M records* , it takes around 120 GB. Any operation takes hardly 2 to 3 ms
when query is performed on index attribute.

Usually 10 to 12 columns are the OOTB behaviour but one can configure any
attribute to be indexed on the fly. Main issue is , what should be the
strategy to partition the records if your query is not fixed ?


Regards,
Rajesh

On Tue, Feb 20, 2018 at 2:09 AM, Rahul Singh 
wrote:

> What is the data size in TB / Gb and what what is the Operations Per
> second for read and write.
> Cassandra is both for high volume and high velocity for read and write.
>
> How many of the columns need to be indexed? You may find that doing a
> secondary index is helpful or looking to Elassandra / DSE SolR if your
> queries need to be on arbitrary columns across those hundred.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Feb 19, 2018, 11:31 AM -0500, Rajesh Kishore ,
> wrote:
>
> It can be minimum of 20 m to 10 billions
>
> With each entry can contain upto 100 columns
>
> Rajesh
>
> On 19 Feb 2018 9:02 p.m., "Rahul Singh" 
> wrote:
>
> How much data do you need to store and what is the frequency of reads and
> writes.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Feb 19, 2018, 3:44 AM -0500, Rajesh Kishore ,
> wrote:
>
> Hi All,
>
> I am a newbie to Cassandra world, got some understanding of the product.
> I have a application (which is kind of datastore) for other applications,
> the user queries are not fixed i.e the queries can come with any attributes.
> In this case, is it recommended to use cassandra ? What benefits we can
> get ?
>
> Background - The application currently  using berkely db for maintaining
> entries, we are trying to evaluate if other backend can fit with the
> requirement we have.
>
> Now, if we want to use cassandra , I broadly see one table which would
> contain all the entries. Now, the question is what should be the correct
> partitioning majors ?
> entity is
> Entry {
> id varchar,
> objectclasses list
> sn
> cn
> ...
> ...
> }
>
> and query can be anything like
> a) get all entries based on sn=*
> b) get all entries based on sn=A and cn=b
> c) get all entries based on sn=A OR objeclass contains person
> ..
> 
>
> Please advise.
>
> Thanks,
> Rajesh
>
>
>


RE: Cassandra Needs to Grow Up by Version Five!

2018-02-19 Thread Kenneth Brotman
Jeff, you helped me figure out what I was missing.  It just took me a day to 
digest what you wrote.  I’m coming over from another type of engineering.  I 
didn’t know and it’s not really documented.  Cassandra runs in a data center.  
Now days that means the nodes are going to be in managed containers, Docker 
containers, managed by Kerbernetes,  Meso or something, and for that reason 
anyone operating Cassandra in a real world setting would not encounter the 
issues I raised in the way I described.

 

Shouldn’t the architectural diagrams people reference indicate that in some 
way?  That would have help me.

 

Kenneth Brotman

 

From: Kenneth Brotman [mailto:kenbrot...@yahoo.com] 
Sent: Monday, February 19, 2018 10:43 AM
To: 'user@cassandra.apache.org'
Cc: 'd...@cassandra.apache.org'
Subject: RE: Cassandra Needs to Grow Up by Version Five!

 

Well said.  Very fair.  I wouldn’t mind hearing from others still.  You’re a 
good guy!

 

Kenneth Brotman

 

From: Jeff Jirsa [mailto:jji...@gmail.com] 
Sent: Monday, February 19, 2018 9:10 AM
To: cassandra
Cc: Cassandra DEV
Subject: Re: Cassandra Needs to Grow Up by Version Five!

 

There's a lot of things below I disagree with, but it's ok. I convinced myself 
not to nit-pick every point.

 

https://issues.apache.org/jira/browse/CASSANDRA-13971 has some of Stefan's work 
with cert management

 

Beyond that, I encourage you to do what Michael suggested: open JIRAs for 
things you care strongly about, work on them if you have time. Sometime this 
year we'll schedule a NGCC (Next Generation Cassandra Conference) where we talk 
about future project work and direction, I encourage you to attend if you're 
able (I encourage anyone who cares about the direction of Cassandra to attend, 
it's probably be either free or very low cost, just to cover a venue and some 
food). If nothing else, you'll meet some of the teams who are working on the 
project, and learn why they've selected the projects on which they're working. 
You'll have an opportunity to pitch your vision, and maybe you can talk some 
folks into helping out. 

 

- Jeff

 

 

 

 

On Mon, Feb 19, 2018 at 1:01 AM, Kenneth Brotman  
wrote:

Comments inline

>-Original Message-
>From: Jeff Jirsa [mailto:jji...@gmail.com]
>Sent: Sunday, February 18, 2018 10:58 PM
>To: user@cassandra.apache.org
>Cc: d...@cassandra.apache.org
>Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
>Comments inline
>
>
>> On Feb 18, 2018, at 9:39 PM, Kenneth Brotman  
>> wrote:
>>
> >Cassandra feels like an unfinished program to me. The problem is not that 
> >it’s open source or cutting edge.  It’s an open source cutting edge program 
> >that lacks some of its basic functionality.  We are all stuck addressing 
> >fundamental mechanical tasks for Cassandra because the basic code that would 
> >do that part has not been contributed yet.
>>
>There’s probably 2-3 reasons why here:
>
>1) Historically the pmc has tried to keep the scope of the project very 
>narrow. It’s a database. We don’t ship drivers. We don’t ship developer tools. 
>We don’t ship fancy UIs. We ship a database. I think for the most part the 
>narrow vision has been for the best, but maybe it’s time to reconsider some of 
>the scope.
>
>Postgres will autovacuum to prevent wraparound (hopefully),  but everyone I 
>know running Postgres uses flexible-freeze in cron - sometimes it’s ok to let 
>the database have its opinions and let third party tools fill in the gaps.
>

I can appreciate the desire to stay in scope.  I believe usability is the King. 
 When users have to learn the database, then learn what they have to automate, 
then learn an automation tool and then use the automation tool to do something 
that is as fundamental as the fundamental tasks I described, then something is 
missing from the database itself that is adversely affecting usability - and 
that is very bad.  Where those big companies need to calculate the ROI is in 
the cost of acquiring or training the next group of users.  Consider how steep 
the learning curve is for new users.  Consider the business case for improving 
ease of use.

>2) Cassandra is, by definition, a database for large scale problems. Most of 
>the companies working on/with it tend to be big companies. Big companies often 
>have pre-existing automation that solved the stuff you consider fundamental 
>tasks, so there’s probably nobody actively working on the solved problems that 
>you may consider missing features - for many people they’re already solved.
>

I could be wrong but it sounds like a lot of the code work is done, and if the 
companies would take the time to contribute more code, then the rest of the 
code needed could be generated easily.

>3) It’s not nearly as basic as you think it is. Datastax seemingly had a 
>multi-person team on opscenter, and while it was better than anything else 
>around last time I used it (before it stopped 

Re: Right sizing Cassandra data nodes

2018-02-19 Thread Charulata Sharma (charshar)
Thanks for the response Rahul. I did not understand the “node density” point.

Charu

From: Rahul Singh 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, February 19, 2018 at 12:32 PM
To: "user@cassandra.apache.org" 
Subject: Re: Right sizing Cassandra data nodes

1. I would keep opscenter on different cluster. Why unnecessarily put traffic 
and computing for opscenter data on a real business data cluster?
2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it increases 
creates more replication, read repairs , etc and memory usage for doing the 
compactions etc.
3. Can have as much as you want for snapshots as long as you have it on another 
disk or even move it to a SAN / NAS. All you may care about us the most recent 
snapshot on the physical machine / disks on a live node.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) 
, wrote:

Hi All,

Looking for some insight into how application data archive and purge is carried 
out for C* database. Are there standard guidelines on calculating the amount of 
space that can be used for storing data in a specific node.

Some pointers that I got while researching are;


-  Allocate 50% space for compaction, e.g. if data size is 50GB then 
allocate 25GB for compaction.

-  Snapshot strategy. If old snapshots are present, then they occupy 
the disk space.

-  Allocate some percentage of storage (  ) for system tables and 
OpsCenter tables ?

We have a scenario where certain transaction data needs to be archived based on 
business rules and some purged, so before deciding on an A strategy, I am 
trying to analyze
how much transactional data can be stored given the current node capacity. I 
also found out that the space available metric shown in Opscenter is not very 
reliable because it doesn’t show
the snapshot space. In our case, we have a huge snapshot size. For some 
unexplained reason, we seem to be taking snapshots of our data every hour and 
purging them only after 7 days.


Thanks,
Charu
Cisco Systems.





Re: newbie , to use cassandra when query is arbitrary?

2018-02-19 Thread Rahul Singh
What is the data size in TB / Gb and what what is the Operations Per second for 
read and write.
Cassandra is both for high volume and high velocity for read and write.

How many of the columns need to be indexed? You may find that doing a secondary 
index is helpful or looking to Elassandra / DSE SolR if your queries need to be 
on arbitrary columns across those hundred.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 19, 2018, 11:31 AM -0500, Rajesh Kishore , 
wrote:
> It can be minimum of 20 m to 10 billions
>
> With each entry can contain upto 100 columns
>
> Rajesh
>
> > On 19 Feb 2018 9:02 p.m., "Rahul Singh"  
> > wrote:
> > > How much data do you need to store and what is the frequency of reads and 
> > > writes.
> > >
> > > --
> > > Rahul Singh
> > > rahul.si...@anant.us
> > >
> > > Anant Corporation
> > >
> > > On Feb 19, 2018, 3:44 AM -0500, Rajesh Kishore , 
> > > wrote:
> > > > Hi All,
> > > >
> > > > I am a newbie to Cassandra world, got some understanding of the product.
> > > > I have a application (which is kind of datastore) for other 
> > > > applications, the user queries are not fixed i.e the queries can come 
> > > > with any attributes.
> > > > In this case, is it recommended to use cassandra ? What benefits we can 
> > > > get ?
> > > >
> > > > Background - The application currently  using berkely db for 
> > > > maintaining entries, we are trying to evaluate if other backend can fit 
> > > > with the requirement we have.
> > > >
> > > > Now, if we want to use cassandra , I broadly see one table which would 
> > > > contain all the entries. Now, the question is what should be the 
> > > > correct partitioning majors ?
> > > > entity is
> > > > Entry {
> > > > id varchar,
> > > > objectclasses list
> > > > sn
> > > > cn
> > > > ...
> > > > ...
> > > > }
> > > >
> > > > and query can be anything like
> > > > a) get all entries based on sn=*
> > > > b) get all entries based on sn=A and cn=b
> > > > c) get all entries based on sn=A OR objeclass contains person
> > > > ..
> > > > 
> > > >
> > > > Please advise.
> > > >
> > > > Thanks,
> > > > Rajesh
>


Re: Right sizing Cassandra data nodes

2018-02-19 Thread Rahul Singh
1. I would keep opscenter on different cluster. Why unnecessarily put traffic 
and computing for opscenter data on a real business data cluster?
2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it increases 
creates more replication, read repairs , etc and memory usage for doing the 
compactions etc.
3. Can have as much as you want for snapshots as long as you have it on another 
disk or even move it to a SAN / NAS. All you may care about us the most recent 
snapshot on the physical machine / disks on a live node.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) 
, wrote:
> Hi All,
>
> Looking for some insight into how application data archive and purge is 
> carried out for C* database. Are there standard guidelines on calculating the 
> amount of space that can be used for storing data in a specific node.
>
> Some pointers that I got while researching are;
>
> -  Allocate 50% space for compaction, e.g. if data size is 50GB then 
> allocate 25GB for compaction.
> -  Snapshot strategy. If old snapshots are present, then they occupy 
> the disk space.
> -  Allocate some percentage of storage (  ) for system tables and 
> OpsCenter tables ?
>
> We have a scenario where certain transaction data needs to be archived based 
> on business rules and some purged, so before deciding on an A strategy, I 
> am trying to analyze
> how much transactional data can be stored given the current node capacity. I 
> also found out that the space available metric shown in Opscenter is not very 
> reliable because it doesn’t show
> the snapshot space. In our case, we have a huge snapshot size. For some 
> unexplained reason, we seem to be taking snapshots of our data every hour and 
> purging them only after 7 days.
>
>
> Thanks,
> Charu
> Cisco Systems.
>
>
>


Re: SSTableLoader Question

2018-02-19 Thread shalom sagges
Sounds good.

Thanks for the explanation!

On Sun, Feb 18, 2018 at 5:15 PM, Rahul Singh 
wrote:

> If you don’t have access to the file you don’t have access to the file.
> I’ve seen this issue several times. It’s he easiest low hanging fruit to
> resolve. So figure it out and make sure that it’s Cassandra.Cassandra from
> root to he Data folder and either run as root or sudo it.
>
> If it’s compacted it won’t be there so you won’t have the file. I’m not
> aware of this event being communicated to Sstableloader via SEDA. Besides,
> the sstable that you are loading SHOULD not be live. If you at streaming a
> life sstable, it means you are using sstableloader not as it is designed to
> be used - which is with static files.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Feb 18, 2018, 9:22 AM -0500, shalom sagges ,
> wrote:
>
> Not really sure with which user I ran it (root or cassandra), although I
> don't understand why a permission issue will generate a File not Found
> exception?
>
> And in general, what if a file is being streamed and got compacted before
> the streaming ended. Does Cassandra know how to handle this?
>
> Thanks!
>
> On Sun, Feb 18, 2018 at 3:58 PM, Rahul Singh  > wrote:
>
>> Check permissions maybe? Who owns the files vs. who is running
>> sstableloader.
>>
>> --
>> Rahul Singh
>> rahul.si...@anant.us
>>
>> Anant Corporation
>>
>> On Feb 18, 2018, 4:26 AM -0500, shalom sagges ,
>> wrote:
>>
>> Hi All,
>>
>> C* version 2.0.14.
>>
>> I was loading some data to another cluster using SSTableLoader. The
>> streaming failed with the following error:
>>
>>
>> Streaming error occurred
>> java.lang.RuntimeException: java.io.*FileNotFoundException*:
>> /data1/keyspace1/table1/keyspace1-table1-jb-65174-Data.db (No such file
>> or directory)
>> at org.apache.cassandra.io.compress.CompressedRandomAccessReade
>> r.open(CompressedRandomAccessReader.java:59)
>> at org.apache.cassandra.io.sstable.SSTableReader.openDataReader
>> (SSTableReader.java:1409)
>> at org.apache.cassandra.streaming.compress.CompressedStreamWrit
>> er.write(CompressedStreamWriter.java:55)
>> at org.apache.cassandra.streaming.messages.OutgoingFileMessage$
>> 1.serialize(OutgoingFileMessage.java:59)
>> at org.apache.cassandra.streaming.messages.OutgoingFileMessage$
>> 1.serialize(OutgoingFileMessage.java:42)
>> at org.apache.cassandra.streaming.messages.StreamMessage.serial
>> ize(StreamMessage.java:45)
>> at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMes
>> sageHandler.sendMessage(ConnectionHandler.java:339)
>> at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMes
>> sageHandler.run(ConnectionHandler.java:311)
>> at java.lang.Thread.run(Thread.java:722)
>> Caused by: java.io.*FileNotFoundException*:
>> /data1/keyspace1/table1/keyspace1-table1-jb-65174-Data.db (No such file
>> or directory)
>> at java.io.RandomAccessFile.open(Native Method)
>> at java.io.RandomAccessFile.(RandomAccessFile.java:233)
>> at org.apache.cassandra.io.util.RandomAccessReader.(Rando
>> mAccessReader.java:58)
>> at org.apache.cassandra.io.compress.CompressedRandomAccessReade
>> r.(CompressedRandomAccessReader.java:76)
>> at org.apache.cassandra.io.compress.CompressedRandomAccessReade
>> r.open(CompressedRandomAccessReader.java:55)
>> ... 8 more
>>  WARN 18:31:35,938 [Stream #7243efb0-1262-11e8-8562-d19d5fe7829c] Stream
>> failed
>>
>>
>>
>> Did I miss something when running the load? Was the file suddenly missing
>> due to compaction?
>> If so, did I need to disable auto compaction or stop the service
>> beforehand? (didn't find any reference to compaction in the docs)
>>
>> I know it's an old version, but I didn't find any related bugs on "File
>> not found" exceptions.
>>
>> Thanks!
>>
>>
>>
>


Right sizing Cassandra data nodes

2018-02-19 Thread Charulata Sharma (charshar)
Hi All,

Looking for some insight into how application data archive and purge is carried 
out for C* database. Are there standard guidelines on calculating the amount of 
space that can be used for storing data in a specific node.

Some pointers that I got while researching are;


-  Allocate 50% space for compaction, e.g. if data size is 50GB then 
allocate 25GB for compaction.

-  Snapshot strategy. If old snapshots are present, then they occupy 
the disk space.

-  Allocate some percentage of storage (  ) for system tables and 
OpsCenter tables ?

We have a scenario where certain transaction data needs to be archived based on 
business rules and some purged, so before deciding on an A strategy, I am 
trying to analyze
how much transactional data can be stored given the current node capacity. I 
also found out that the space available metric shown in Opscenter is not very 
reliable because it doesn’t show
the snapshot space. In our case, we have a huge snapshot size. For some 
unexplained reason, we seem to be taking snapshots of our data every hour and 
purging them only after 7 days.


Thanks,
Charu
Cisco Systems.





[RELEASE] Apache Cassandra 3.11.2 released - PLEASE READ NOTICE

2018-02-19 Thread Michael Shuler
PLEASE READ: MAXIMUM TTL EXPIRATION DATE NOTICE (CASSANDRA-14092)
--

The maximum expiration timestamp that can be represented by the storage
engine is 2038-01-19T03:14:06+00:00, which means that inserts with TTL
thatl expire after this date are not currently supported. By default,
INSERTS with TTL exceeding the maximum supported date are rejected, but
it's possible to choose a different expiration overflow policy. See
CASSANDRA-14092.txt for more details.

Prior to 3.0.16 (3.0.X) and 3.11.2 (3.11.x) there was no protection
against INSERTS with TTL expiring after the maximum supported date,
causing the expiration time field to overflow and the records to expire
immediately. Clusters in the 2.X and lower series are not subject to
this when assertions are enabled. Backed up SSTables can be potentially
recovered and recovery instructions can be found on the
CASSANDRA-14092.txt file.

If you use or plan to use very large TTLS (10 to 20 years), read
CASSANDRA-14092.txt for more information.
--

The Cassandra team is pleased to announce the release of Apache
Cassandra version 3.11.2.

Apache Cassandra is a fully distributed database. It is the right choice
when you need scalability and high availability without compromising
performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

 http://cassandra.apache.org/download/

This version is a bug fix release[1] on the 3.11 series. As always,
please pay attention to the release notes[2] and Let us know[3] if you
were to encounter any problem.

Enjoy!

[1]: (CHANGES.txt) https://goo.gl/mQjYnb
[2]: (NEWS.txt) https://goo.gl/NJGdhu
[3]: https://issues.apache.org/jira/browse/CASSANDRA



signature.asc
Description: OpenPGP digital signature


[RELEASE] Apache Cassandra 3.0.16 released - PLEASE READ NOTICE

2018-02-19 Thread Michael Shuler
PLEASE READ: MAXIMUM TTL EXPIRATION DATE NOTICE (CASSANDRA-14092)
--

The maximum expiration timestamp that can be represented by the storage
engine is 2038-01-19T03:14:06+00:00, which means that inserts with TTL
thatl expire after this date are not currently supported. By default,
INSERTS with TTL exceeding the maximum supported date are rejected, but
it's possible to choose a different expiration overflow policy. See
CASSANDRA-14092.txt for more details.

Prior to 3.0.16 (3.0.X) and 3.11.2 (3.11.x) there was no protection
against INSERTS with TTL expiring after the maximum supported date,
causing the expiration time field to overflow and the records to expire
immediately. Clusters in the 2.X and lower series are not subject to
this when assertions are enabled. Backed up SSTables can be potentially
recovered and recovery instructions can be found on the
CASSANDRA-14092.txt file.

If you use or plan to use very large TTLS (10 to 20 years), read
CASSANDRA-14092.txt for more information.
--

The Cassandra team is pleased to announce the release of Apache
Cassandra version 3.0.16.

Apache Cassandra is a fully distributed database. It is the right choice
when you need scalability and high availability without compromising
performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

 http://cassandra.apache.org/download/

This version is a bug fix release[1] on the 3.0 series. As always,
please pay attention to the release notes[2] and Let us know[3] if you
were to encounter any problem.

Enjoy!

[1]: (CHANGES.txt) https://goo.gl/ST7ij6
[2]: (NEWS.txt) https://goo.gl/Ek5hve
[3]: https://issues.apache.org/jira/browse/CASSANDRA



signature.asc
Description: OpenPGP digital signature


RE: Cassandra Needs to Grow Up by Version Five!

2018-02-19 Thread Kenneth Brotman
Well said.  Very fair.  I wouldn’t mind hearing from others still.  You’re a 
good guy!

 

Kenneth Brotman

 

From: Jeff Jirsa [mailto:jji...@gmail.com] 
Sent: Monday, February 19, 2018 9:10 AM
To: cassandra
Cc: Cassandra DEV
Subject: Re: Cassandra Needs to Grow Up by Version Five!

 

There's a lot of things below I disagree with, but it's ok. I convinced myself 
not to nit-pick every point.

 

https://issues.apache.org/jira/browse/CASSANDRA-13971 has some of Stefan's work 
with cert management

 

Beyond that, I encourage you to do what Michael suggested: open JIRAs for 
things you care strongly about, work on them if you have time. Sometime this 
year we'll schedule a NGCC (Next Generation Cassandra Conference) where we talk 
about future project work and direction, I encourage you to attend if you're 
able (I encourage anyone who cares about the direction of Cassandra to attend, 
it's probably be either free or very low cost, just to cover a venue and some 
food). If nothing else, you'll meet some of the teams who are working on the 
project, and learn why they've selected the projects on which they're working. 
You'll have an opportunity to pitch your vision, and maybe you can talk some 
folks into helping out. 

 

- Jeff

 

 

 

 

On Mon, Feb 19, 2018 at 1:01 AM, Kenneth Brotman  
wrote:

Comments inline

>-Original Message-
>From: Jeff Jirsa [mailto:jji...@gmail.com]
>Sent: Sunday, February 18, 2018 10:58 PM
>To: user@cassandra.apache.org
>Cc: d...@cassandra.apache.org
>Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
>Comments inline
>
>
>> On Feb 18, 2018, at 9:39 PM, Kenneth Brotman  
>> wrote:
>>
> >Cassandra feels like an unfinished program to me. The problem is not that 
> >it’s open source or cutting edge.  It’s an open source cutting edge program 
> >that lacks some of its basic functionality.  We are all stuck addressing 
> >fundamental mechanical tasks for Cassandra because the basic code that would 
> >do that part has not been contributed yet.
>>
>There’s probably 2-3 reasons why here:
>
>1) Historically the pmc has tried to keep the scope of the project very 
>narrow. It’s a database. We don’t ship drivers. We don’t ship developer tools. 
>We don’t ship fancy UIs. We ship a database. I think for the most part the 
>narrow vision has been for the best, but maybe it’s time to reconsider some of 
>the scope.
>
>Postgres will autovacuum to prevent wraparound (hopefully),  but everyone I 
>know running Postgres uses flexible-freeze in cron - sometimes it’s ok to let 
>the database have its opinions and let third party tools fill in the gaps.
>

I can appreciate the desire to stay in scope.  I believe usability is the King. 
 When users have to learn the database, then learn what they have to automate, 
then learn an automation tool and then use the automation tool to do something 
that is as fundamental as the fundamental tasks I described, then something is 
missing from the database itself that is adversely affecting usability - and 
that is very bad.  Where those big companies need to calculate the ROI is in 
the cost of acquiring or training the next group of users.  Consider how steep 
the learning curve is for new users.  Consider the business case for improving 
ease of use.

>2) Cassandra is, by definition, a database for large scale problems. Most of 
>the companies working on/with it tend to be big companies. Big companies often 
>have pre-existing automation that solved the stuff you consider fundamental 
>tasks, so there’s probably nobody actively working on the solved problems that 
>you may consider missing features - for many people they’re already solved.
>

I could be wrong but it sounds like a lot of the code work is done, and if the 
companies would take the time to contribute more code, then the rest of the 
code needed could be generated easily.

>3) It’s not nearly as basic as you think it is. Datastax seemingly had a 
>multi-person team on opscenter, and while it was better than anything else 
>around last time I used it (before it stopped supporting the OSS version), it 
>left a lot to be desired. It’s probably 2-3 engineers working for a month  to 
>have any sort of meaningful, reliable, mostly trivial cluster-managing UI, and 
>I can think of about 10 JIRAs I’d rather see that time be spent on first.

How about 6-9 engineers working 12 months a year on it then.  I'm not kidding.  
For a big company with revenues in the tens of billions or more, and a heavy 
use of Cassandra nodes, it's easy to make a case for having a full time person 
or more that involved.  They aren't paying for using the open source code that 
is Cassandra.  Let's see what would the licensing fees be for a big company if 
the costs where like Microsoft or Oracle would charge for their enterprise 
level relational database?   What's the contribution of one or two people in 
comparison.

>> Ease of use issues 

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-19 Thread Jeff Jirsa
There's a lot of things below I disagree with, but it's ok. I convinced
myself not to nit-pick every point.

https://issues.apache.org/jira/browse/CASSANDRA-13971 has some of Stefan's
work with cert management

Beyond that, I encourage you to do what Michael suggested: open JIRAs for
things you care strongly about, work on them if you have time. Sometime
this year we'll schedule a NGCC (Next Generation Cassandra Conference)
where we talk about future project work and direction, I encourage you to
attend if you're able (I encourage anyone who cares about the direction of
Cassandra to attend, it's probably be either free or very low cost, just to
cover a venue and some food). If nothing else, you'll meet some of the
teams who are working on the project, and learn why they've selected the
projects on which they're working. You'll have an opportunity to pitch your
vision, and maybe you can talk some folks into helping out.

- Jeff




On Mon, Feb 19, 2018 at 1:01 AM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> Comments inline
>
> >-Original Message-
> >From: Jeff Jirsa [mailto:jji...@gmail.com]
> >Sent: Sunday, February 18, 2018 10:58 PM
> >To: user@cassandra.apache.org
> >Cc: d...@cassandra.apache.org
> >Subject: Re: Cassandra Needs to Grow Up by Version Five!
> >
> >Comments inline
> >
> >
> >> On Feb 18, 2018, at 9:39 PM, Kenneth Brotman
>  wrote:
> >>
> > >Cassandra feels like an unfinished program to me. The problem is not
> that it’s open source or cutting edge.  It’s an open source cutting edge
> program that lacks some of its basic functionality.  We are all stuck
> addressing fundamental mechanical tasks for Cassandra because the basic
> code that would do that part has not been contributed yet.
> >>
> >There’s probably 2-3 reasons why here:
> >
> >1) Historically the pmc has tried to keep the scope of the project very
> narrow. It’s a database. We don’t ship drivers. We don’t ship developer
> tools. We don’t ship fancy UIs. We ship a database. I think for the most
> part the narrow vision has been for the best, but maybe it’s time to
> reconsider some of the scope.
> >
> >Postgres will autovacuum to prevent wraparound (hopefully),  but everyone
> I know running Postgres uses flexible-freeze in cron - sometimes it’s ok to
> let the database have its opinions and let third party tools fill in the
> gaps.
> >
>
> I can appreciate the desire to stay in scope.  I believe usability is the
> King.  When users have to learn the database, then learn what they have to
> automate, then learn an automation tool and then use the automation tool to
> do something that is as fundamental as the fundamental tasks I described,
> then something is missing from the database itself that is adversely
> affecting usability - and that is very bad.  Where those big companies need
> to calculate the ROI is in the cost of acquiring or training the next group
> of users.  Consider how steep the learning curve is for new users.
> Consider the business case for improving ease of use.
>
> >2) Cassandra is, by definition, a database for large scale problems. Most
> of the companies working on/with it tend to be big companies. Big companies
> often have pre-existing automation that solved the stuff you consider
> fundamental tasks, so there’s probably nobody actively working on the
> solved problems that you may consider missing features - for many people
> they’re already solved.
> >
>
> I could be wrong but it sounds like a lot of the code work is done, and if
> the companies would take the time to contribute more code, then the rest of
> the code needed could be generated easily.
>
> >3) It’s not nearly as basic as you think it is. Datastax seemingly had a
> multi-person team on opscenter, and while it was better than anything else
> around last time I used it (before it stopped supporting the OSS version),
> it left a lot to be desired. It’s probably 2-3 engineers working for a
> month  to have any sort of meaningful, reliable, mostly trivial
> cluster-managing UI, and I can think of about 10 JIRAs I’d rather see that
> time be spent on first.
>
> How about 6-9 engineers working 12 months a year on it then.  I'm not
> kidding.  For a big company with revenues in the tens of billions or more,
> and a heavy use of Cassandra nodes, it's easy to make a case for having a
> full time person or more that involved.  They aren't paying for using the
> open source code that is Cassandra.  Let's see what would the licensing
> fees be for a big company if the costs where like Microsoft or Oracle would
> charge for their enterprise level relational database?   What's the
> contribution of one or two people in comparison.
>
> >> Ease of use issues need to be given much more attention.  For an
> administrator, the ease of use of Cassandra is very poor.
> >>
> >>Furthermore, currently Cassandra is an idiot.  We have to do everything
> for Cassandra. Contrast that with the fact that we are in 

Re: Cassandra counter readtimeout error

2018-02-19 Thread Alain RODRIGUEZ
Hi Javier,

Glad to hear it is solved now. Cassandra 3.11.1 should be a more stable
version and 3.11 a better series.

Excuse my misunderstanding, your table seems to be better designed than
thought.

Welcome to the Apache Cassandra community!

C*heers ;-)
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



2018-02-19 9:31 GMT+00:00 Javier Pareja :

> Hi,
>
> Thank you for your reply.
>
> As I was bothered by this problem, last night I upgraded the cluster to
> version 3.11.1 and everything is working now. As far as I can tell the
> counter table can be read now. I will be doing more testing today with this
> version but it is looking good.
>
> To answer your questions:
> - I might not have explained the table definition very well but the table
> does not have 6 partitions, but 6 partition keys. There are thousands of
> partitions in that table, a combination of all those partition keys. I also
> made sure that the partitions remained small when designing the table.
> - I also enabled tracing in the CQLSH but it showed nothing when querying
> this row. It however did when querying other tables...
>
> Thanks again for your reply!! I am very excited to be part of the
> Cassandra user base.
>
> Javier
>
>
>
> F Javier Pareja
>
> On Mon, Feb 19, 2018 at 8:08 AM, Alain RODRIGUEZ 
> wrote:
>
>>
>> Hello,
>>
>> This table has 6 partition keys, 4 primary keys and 5 counters.
>>
>>
>> I think the root issue is this ^. There might be some inefficiency or
>> issues with counter, but this design, makes Cassandra relatively
>> inefficient in most cases and using standard columns or counters
>> indifferently.
>>
>> Cassandra data is supposed to be well distributed for a maximal
>> efficiency. With only 6 partitions, if you have 6+ nodes, there is 100%
>> chances that the load is fairly imbalanced. If you have less nodes, it's
>> still probably poorly balanced. Also reading from a small number of
>> sstables and in parallel within many nodes ideally to split the work and
>> make queries efficient, but in this case cassandra is reading huge
>> partitions from one node most probably. When the size of the request is too
>> big it can timeout. I am not sure how pagination works with counters, but I
>> believe even if pagination is working, at some point, you are just reading
>> too much (or too inefficiently) and the timeout is reached.
>>
>> I imagined it worked well for a while as counters are very small columns
>> / tables compared to any event data but at some point you might have
>> reached 'physical' limit, because you are pulling *all* the information
>> you need from one partition (and probably many SSTables)
>>
>> Is there really no other way to design this use case?
>>
>> When data starts to be inserted, I can query the counters correctly from
>>> that particular row but after a few minutes updating the table with
>>> thousands of events, I get a read timeout every time
>>>
>>
>> Troubleshot:
>> - Use tracing to understand what takes so long with your queries
>> - Check for warns / error in the logs. Cassandra use to complain when it
>> is unhappy with the configurations. There a lot of interesting and it's
>> been a while I last had a failure with no relevant informations in the logs.
>> - Check SSTable per read and other read performances for this counter
>> table. Using some monitoring could make the reason of this timeout obvious.
>> If you use Datadog for example, I guess that a quick look at the "Read
>> Path" Dashboard would help. If you are using any other tool, look for
>> SSTable per reads, Tombstone scanned (if any), keycache hitrate, resources
>> (as maybe fast insert rate compactions and implicit 'read-before-writes'
>> are making the machine less responsive.
>>
>> Fix:
>> - Improve design to improve the findings you made above ^
>> - Improve compaction strategy or read operations depending on the
>> findings above ^
>>
>> I am not saying there is no bug in counters and in your version, but I
>> would say it is to early to state this, given the data model, some other
>> reasons could explain this slowness.
>>
>> If you don't have any monitoring in place, tracing and logs are a nice
>> place to start digging. If you want to share those here, we can help
>> interpreting outputs you will share if needed :).
>>
>> C*heers,
>>
>> Alain
>>
>>
>> 2018-02-17 11:40 GMT+00:00 Javier Pareja :
>>
>>> Hello everyone,
>>>
>>> I get a timeout error when reading a particular row from a large
>>> counters table.
>>>
>>> I have a storm topology that inserts data into a Cassandra counter
>>> table. This table has 6 partition keys, 4 primary keys and 5 counters.
>>>
>>> When data starts to be inserted, I can query the counters correctly from
>>> that particular row but after a few minutes updating the table with
>>> thousands of 

Re: newbie , to use cassandra when query is arbitrary?

2018-02-19 Thread Rajesh Kishore
It can be minimum of 20 m to 10 billions

With each entry can contain upto 100 columns

Rajesh

On 19 Feb 2018 9:02 p.m., "Rahul Singh" 
wrote:

How much data do you need to store and what is the frequency of reads and
writes.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 19, 2018, 3:44 AM -0500, Rajesh Kishore ,
wrote:

Hi All,

I am a newbie to Cassandra world, got some understanding of the product.
I have a application (which is kind of datastore) for other applications,
the user queries are not fixed i.e the queries can come with any attributes.
In this case, is it recommended to use cassandra ? What benefits we can get
?

Background - The application currently  using berkely db for maintaining
entries, we are trying to evaluate if other backend can fit with the
requirement we have.

Now, if we want to use cassandra , I broadly see one table which would
contain all the entries. Now, the question is what should be the correct
partitioning majors ?
entity is
Entry {
id varchar,
objectclasses list
sn
cn
...
...
}

and query can be anything like
a) get all entries based on sn=*
b) get all entries based on sn=A and cn=b
c) get all entries based on sn=A OR objeclass contains person
..


Please advise.

Thanks,
Rajesh


Re: newbie , to use cassandra when query is arbitrary?

2018-02-19 Thread Rahul Singh
How much data do you need to store and what is the frequency of reads and 
writes.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 19, 2018, 3:44 AM -0500, Rajesh Kishore , wrote:
> Hi All,
>
> I am a newbie to Cassandra world, got some understanding of the product.
> I have a application (which is kind of datastore) for other applications, the 
> user queries are not fixed i.e the queries can come with any attributes.
> In this case, is it recommended to use cassandra ? What benefits we can get ?
>
> Background - The application currently  using berkely db for maintaining 
> entries, we are trying to evaluate if other backend can fit with the 
> requirement we have.
>
> Now, if we want to use cassandra , I broadly see one table which would 
> contain all the entries. Now, the question is what should be the correct 
> partitioning majors ?
> entity is
> Entry {
> id varchar,
> objectclasses list
> sn
> cn
> ...
> ...
> }
>
> and query can be anything like
> a) get all entries based on sn=*
> b) get all entries based on sn=A and cn=b
> c) get all entries based on sn=A OR objeclass contains person
> ..
> 
>
> Please advise.
>
> Thanks,
> Rajesh


Re: Cassandra cluster: could not reach linear scalability

2018-02-19 Thread onmstester onmstester
Thanks for your help,

I'been biased with Cassandra server and forget about the client completely!


Sent using Zoho Mail






 On Mon, 19 Feb 2018 15:21:03 +0330 Lucas Benevides 
lu...@maurobenevides.com.br wrote 




Why did you set the number of 1000 threads?

Does it show to be the more performatic when threads = auto?



I have used stress tool in a larger test bed (10 nodes) and my optimal setup 
was 24 threads. 

To check this you must monitor the stress node, both the CPU and I/O. And give 
it a try with fewer threads.



Lucas Benevides

Ipea





2018-02-18 8:29 GMT-03:00 onmstester onmstester onmstes...@zoho.com:








I've configured a simple cluster using two PC with identical spec:

 cpu core i5 RAM: 8GB ddr3 Disk: 1TB 5400rpm Network: 1 G (I've test it with 
iperf, it really is!) 
using the common configs described in many sites including datastax itself:

cluster_name: 'MyCassandraCluster' num_tokens: 256 seed_provider: - class_name: 
org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: 
"192.168.1.1,192.168.1.2" listen_address: rpc_address: 0.0.0.0 endpoint_snitch: 
GossipingPropertyFileSnitch 
Running stress tool:

cassandra-stress write n=100 -rate threads=1000 -mode native cql3 -node 
192.168.1.1,192.168.1.2 
Over each node it shows 39 K writes/seconds, but running the same stress tool 
command on cluster of both nodes shows 45 K writes/seconds. I've done all the 
tuning mentioned by apache and datastax. There are many use cases on the net 
proving Cassandra linear Scalability So what is wrong with my cluster?



Sent using Zoho Mail












Re: Cassandra cluster: could not reach linear scalability

2018-02-19 Thread Lucas Benevides
Why did you set the number of 1000 threads?
Does it show to be the more performatic when threads = auto?

I have used stress tool in a larger test bed (10 nodes) and my optimal
setup was 24 threads.
To check this you must monitor the stress node, both the CPU and I/O. And
give it a try with fewer threads.

Lucas Benevides
Ipea

2018-02-18 8:29 GMT-03:00 onmstester onmstester :

> I've configured a simple cluster using two PC with identical spec:
>
>   cpu core i5
>RAM: 8GB ddr3
>Disk: 1TB 5400rpm
>Network: 1 G (I've test it with iperf, it really is!)
>
> using the common configs described in many sites including datastax itself:
>
> cluster_name: 'MyCassandraCluster'
> num_tokens: 256
> seed_provider:
>   - class_name: org.apache.cassandra.locator.SimpleSeedProvider
> parameters:
>  - seeds: "192.168.1.1,192.168.1.2"
> listen_address:
> rpc_address: 0.0.0.0
> endpoint_snitch: GossipingPropertyFileSnitch
>
> Running stress tool:
>
> cassandra-stress write n=100 -rate threads=1000 -mode native cql3 -node 
> 192.168.1.1,192.168.1.2
>
> Over each node it shows 39 K writes/seconds, but running the same stress
> tool command on cluster of both nodes shows 45 K writes/seconds. I've done
> all the tuning mentioned by apache and datastax. There are many use cases
> on the net proving Cassandra linear Scalability So what is wrong with my
> cluster?
>
> Sent using Zoho Mail 
>
>
>


Re: Cassandra counter readtimeout error

2018-02-19 Thread Javier Pareja
Hi,

Thank you for your reply.

As I was bothered by this problem, last night I upgraded the cluster to
version 3.11.1 and everything is working now. As far as I can tell the
counter table can be read now. I will be doing more testing today with this
version but it is looking good.

To answer your questions:
- I might not have explained the table definition very well but the table
does not have 6 partitions, but 6 partition keys. There are thousands of
partitions in that table, a combination of all those partition keys. I also
made sure that the partitions remained small when designing the table.
- I also enabled tracing in the CQLSH but it showed nothing when querying
this row. It however did when querying other tables...

Thanks again for your reply!! I am very excited to be part of the Cassandra
user base.

Javier



F Javier Pareja

On Mon, Feb 19, 2018 at 8:08 AM, Alain RODRIGUEZ  wrote:

>
> Hello,
>
> This table has 6 partition keys, 4 primary keys and 5 counters.
>
>
> I think the root issue is this ^. There might be some inefficiency or
> issues with counter, but this design, makes Cassandra relatively
> inefficient in most cases and using standard columns or counters
> indifferently.
>
> Cassandra data is supposed to be well distributed for a maximal
> efficiency. With only 6 partitions, if you have 6+ nodes, there is 100%
> chances that the load is fairly imbalanced. If you have less nodes, it's
> still probably poorly balanced. Also reading from a small number of
> sstables and in parallel within many nodes ideally to split the work and
> make queries efficient, but in this case cassandra is reading huge
> partitions from one node most probably. When the size of the request is too
> big it can timeout. I am not sure how pagination works with counters, but I
> believe even if pagination is working, at some point, you are just reading
> too much (or too inefficiently) and the timeout is reached.
>
> I imagined it worked well for a while as counters are very small columns /
> tables compared to any event data but at some point you might have reached
> 'physical' limit, because you are pulling *all* the information you need
> from one partition (and probably many SSTables)
>
> Is there really no other way to design this use case?
>
> When data starts to be inserted, I can query the counters correctly from
>> that particular row but after a few minutes updating the table with
>> thousands of events, I get a read timeout every time
>>
>
> Troubleshot:
> - Use tracing to understand what takes so long with your queries
> - Check for warns / error in the logs. Cassandra use to complain when it
> is unhappy with the configurations. There a lot of interesting and it's
> been a while I last had a failure with no relevant informations in the logs.
> - Check SSTable per read and other read performances for this counter
> table. Using some monitoring could make the reason of this timeout obvious.
> If you use Datadog for example, I guess that a quick look at the "Read
> Path" Dashboard would help. If you are using any other tool, look for
> SSTable per reads, Tombstone scanned (if any), keycache hitrate, resources
> (as maybe fast insert rate compactions and implicit 'read-before-writes'
> are making the machine less responsive.
>
> Fix:
> - Improve design to improve the findings you made above ^
> - Improve compaction strategy or read operations depending on the findings
> above ^
>
> I am not saying there is no bug in counters and in your version, but I
> would say it is to early to state this, given the data model, some other
> reasons could explain this slowness.
>
> If you don't have any monitoring in place, tracing and logs are a nice
> place to start digging. If you want to share those here, we can help
> interpreting outputs you will share if needed :).
>
> C*heers,
>
> Alain
>
>
> 2018-02-17 11:40 GMT+00:00 Javier Pareja :
>
>> Hello everyone,
>>
>> I get a timeout error when reading a particular row from a large counters
>> table.
>>
>> I have a storm topology that inserts data into a Cassandra counter table.
>> This table has 6 partition keys, 4 primary keys and 5 counters.
>>
>> When data starts to be inserted, I can query the counters correctly from
>> that particular row but after a few minutes updating the table with
>> thousands of events, I get a readtimeout every time I try to read a
>> particular row from the table (the most frequently updated). Other rows I
>> can read quick and fine. Also if I run "select *", the top few hundreds are
>> returned quick and fine as expected. The storm topology is stopped but the
>> error is still there.
>>
>> I am using Cassandra 3.6.
>>
>> More information here:
>> https://stackoverflow.com/q/48833146
>>
>> Are counters in this version broken? I run the query from CQLSH and get
>> the same error every time. I tried running it with trace enabled and get
>> nothing but the error:
>>
>> ReadTimeout: Error from 

RE: Cassandra Needs to Grow Up by Version Five!

2018-02-19 Thread Kenneth Brotman
Comments inline

>-Original Message-
>From: Jeff Jirsa [mailto:jji...@gmail.com] 
>Sent: Sunday, February 18, 2018 10:58 PM
>To: user@cassandra.apache.org
>Cc: d...@cassandra.apache.org
>Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
>Comments inline 
>
>
>> On Feb 18, 2018, at 9:39 PM, Kenneth Brotman  
>> wrote:
>>
> >Cassandra feels like an unfinished program to me. The problem is not that 
> >it’s open source or cutting edge.  It’s an open source cutting edge program 
> >that lacks some of its basic functionality.  We are all stuck addressing 
> >fundamental mechanical tasks for Cassandra because the basic code that would 
> >do that part has not been contributed yet.
>> 
>There’s probably 2-3 reasons why here:
>
>1) Historically the pmc has tried to keep the scope of the project very 
>narrow. It’s a database. We don’t ship drivers. We don’t ship developer tools. 
>We don’t ship fancy UIs. We ship a database. I think for the most part the 
>narrow vision has been for the best, but maybe it’s time to reconsider some of 
>the scope. 
>
>Postgres will autovacuum to prevent wraparound (hopefully),  but everyone I 
>know running Postgres uses flexible-freeze in cron - sometimes it’s ok to let 
>the database have its opinions and let third party tools fill in the gaps.
>

I can appreciate the desire to stay in scope.  I believe usability is the King. 
 When users have to learn the database, then learn what they have to automate, 
then learn an automation tool and then use the automation tool to do something 
that is as fundamental as the fundamental tasks I described, then something is 
missing from the database itself that is adversely affecting usability - and 
that is very bad.  Where those big companies need to calculate the ROI is in 
the cost of acquiring or training the next group of users.  Consider how steep 
the learning curve is for new users.  Consider the business case for improving 
ease of use. 

>2) Cassandra is, by definition, a database for large scale problems. Most of 
>the companies working on/with it tend to be big companies. Big companies often 
>have pre-existing automation that solved the stuff you consider fundamental 
>tasks, so there’s probably nobody actively working on the solved problems that 
>you may consider missing features - for many people they’re already solved.
>

I could be wrong but it sounds like a lot of the code work is done, and if the 
companies would take the time to contribute more code, then the rest of the 
code needed could be generated easily.

>3) It’s not nearly as basic as you think it is. Datastax seemingly had a 
>multi-person team on opscenter, and while it was better than anything else 
>around last time I used it (before it stopped supporting the OSS version), it 
>left a lot to be desired. It’s probably 2-3 engineers working for a month  to 
>have any sort of meaningful, reliable, mostly trivial cluster-managing UI, and 
>I can think of about 10 JIRAs I’d rather see that time be spent on first. 

How about 6-9 engineers working 12 months a year on it then.  I'm not kidding.  
For a big company with revenues in the tens of billions or more, and a heavy 
use of Cassandra nodes, it's easy to make a case for having a full time person 
or more that involved.  They aren't paying for using the open source code that 
is Cassandra.  Let's see what would the licensing fees be for a big company if 
the costs where like Microsoft or Oracle would charge for their enterprise 
level relational database?   What's the contribution of one or two people in 
comparison.

>> Ease of use issues need to be given much more attention.  For an 
>> administrator, the ease of use of Cassandra is very poor. 
>>
>>Furthermore, currently Cassandra is an idiot.  We have to do everything for 
>>Cassandra. Contrast that with the fact that we are in the dawn of artificial 
>>intelligence.
>> 
>
>And for everything you think is obvious, there’s a 50% chance someone else 
>will have already solved differently, and your obvious new solution will be 
>seen as an inconvenient assumption and complexity they won’t appreciate. Open 
>source projects get to walk a fine line of trying to be useful without making 
>too many assumptions, being “too” opinionated, or overstepping bounds. We may 
>be too conservative, but it’s very easy to go too far in the opposite 
>direction. 
>

I appreciate that but when such concerns result in inaction instead of 
resolution that is no good.

>> Software exists to automate tasks for humans, not mechanize humans to 
>> administer tasks for a database.  I’m an engineering type.  My job is to 
>> apply science and technology to solve real world problems.  And that’s where 
>> I need an organization’s I.T. talent to focus; not in crank starting an 
>> unfinished database.
>> 
>
>And that’s why nobody’s done it - we all have bigger problems we’re being paid 
>to solve, and nobody’s felt it necessary. Because 

newbie , to use cassandra when query is arbitrary?

2018-02-19 Thread Rajesh Kishore
Hi All,

I am a newbie to Cassandra world, got some understanding of the product.
I have a application (which is kind of datastore) for other applications,
the user queries are not fixed i.e the queries can come with any attributes.
In this case, is it recommended to use cassandra ? What benefits we can get
?

Background - The application currently  using berkely db for maintaining
entries, we are trying to evaluate if other backend can fit with the
requirement we have.

Now, if we want to use cassandra , I broadly see one table which would
contain all the entries. Now, the question is what should be the correct
partitioning majors ?
entity is
Entry {
id varchar,
objectclasses list
sn
cn
...
...
}

and query can be anything like
a) get all entries based on sn=*
b) get all entries based on sn=A and cn=b
c) get all entries based on sn=A OR objeclass contains person
..


Please advise.

Thanks,
Rajesh


Re: Help needed to enbale Client-to-node encryption(SSL)

2018-02-19 Thread Alain RODRIGUEZ
>
>  (2.0 is getting pretty old and isn't supported, you may want to consider
> upgrading; 2.1 would be the smallest change and least risk, but it, too, is
> near end of life)


I would upgrade as well. Yet I think moving from Cassandra 2.0 to Cassandra
2.2 directly is doable smoothly and preferable (still deserve to be tested
for each environment). Nowadays it might be a better move given that, as
Jeff said, Cassandra 2.1 support is very limited already and will sometime
soon be stopped.

Here is the difference:


>- Apache Cassandra 2.2 is supported until *4.0 release (date TBD)*.
>The latest release is 2.2.12
>
> 
> (pgp
>
> 
>, md5
>
> 
> and sha1
>
> ),
>released on 2018-02-16.
>- Apache Cassandra 2.1 is supported until *4.0 release (date TBD)* with
> *critical fixes only*. The latest release is 2.1.20
>
> 
> (pgp
>
> 
>, md5
>
> 
> and sha1
>
> ),
>released on 2018-02-16.
>
>
It's not a huge difference, but we have been doing upgrades from 2.0 and
2.1 to 2.2 directly if I remember correctly. I would say you have the
choice to skip a major this time and I was willing to share in case it
might be worth it for you.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-02-16 16:34 GMT+00:00 Jeff Jirsa :

> http://thelastpickle.com/blog/2015/09/30/hardening-
> cassandra-step-by-step-part-1-server-to-server.html
>
> https://www.youtube.com/watch?v=CKt0XVPogf4
>
> (2.0 is getting pretty old and isn't supported, you may want to consider
> upgrading; 2.1 would be the smallest change and least risk, but it, too, is
> near end of life)
>
>
>
> On Fri, Feb 16, 2018 at 8:05 AM, Prachi Rath 
> wrote:
>
>> Hi,
>>
>> I am using cassandra version  2.0 . My goal is to do cassandra client to
>> node security using SSL with my self-signed CA.
>>
>> What would be the recommended procedure for enabling SSL on  cassandra
>> version 2.0.17 .
>>
>> Thanks,
>> Prachi
>>
>
>


Re: Cassandra counter readtimeout error

2018-02-19 Thread Alain RODRIGUEZ
Hello,

This table has 6 partition keys, 4 primary keys and 5 counters.


I think the root issue is this ^. There might be some inefficiency or
issues with counter, but this design, makes Cassandra relatively
inefficient in most cases and using standard columns or counters
indifferently.

Cassandra data is supposed to be well distributed for a maximal efficiency.
With only 6 partitions, if you have 6+ nodes, there is 100% chances that
the load is fairly imbalanced. If you have less nodes, it's still probably
poorly balanced. Also reading from a small number of sstables and in
parallel within many nodes ideally to split the work and make queries
efficient, but in this case cassandra is reading huge partitions from one
node most probably. When the size of the request is too big it can timeout.
I am not sure how pagination works with counters, but I believe even if
pagination is working, at some point, you are just reading too much (or too
inefficiently) and the timeout is reached.

I imagined it worked well for a while as counters are very small columns /
tables compared to any event data but at some point you might have reached
'physical' limit, because you are pulling *all* the information you need
from one partition (and probably many SSTables)

Is there really no other way to design this use case?

When data starts to be inserted, I can query the counters correctly from
> that particular row but after a few minutes updating the table with
> thousands of events, I get a read timeout every time
>

Troubleshot:
- Use tracing to understand what takes so long with your queries
- Check for warns / error in the logs. Cassandra use to complain when it is
unhappy with the configurations. There a lot of interesting and it's been a
while I last had a failure with no relevant informations in the logs.
- Check SSTable per read and other read performances for this counter
table. Using some monitoring could make the reason of this timeout obvious.
If you use Datadog for example, I guess that a quick look at the "Read
Path" Dashboard would help. If you are using any other tool, look for
SSTable per reads, Tombstone scanned (if any), keycache hitrate, resources
(as maybe fast insert rate compactions and implicit 'read-before-writes'
are making the machine less responsive.

Fix:
- Improve design to improve the findings you made above ^
- Improve compaction strategy or read operations depending on the findings
above ^

I am not saying there is no bug in counters and in your version, but I
would say it is to early to state this, given the data model, some other
reasons could explain this slowness.

If you don't have any monitoring in place, tracing and logs are a nice
place to start digging. If you want to share those here, we can help
interpreting outputs you will share if needed :).

C*heers,

Alain


2018-02-17 11:40 GMT+00:00 Javier Pareja :

> Hello everyone,
>
> I get a timeout error when reading a particular row from a large counters
> table.
>
> I have a storm topology that inserts data into a Cassandra counter table.
> This table has 6 partition keys, 4 primary keys and 5 counters.
>
> When data starts to be inserted, I can query the counters correctly from
> that particular row but after a few minutes updating the table with
> thousands of events, I get a readtimeout every time I try to read a
> particular row from the table (the most frequently updated). Other rows I
> can read quick and fine. Also if I run "select *", the top few hundreds are
> returned quick and fine as expected. The storm topology is stopped but the
> error is still there.
>
> I am using Cassandra 3.6.
>
> More information here:
> https://stackoverflow.com/q/48833146
>
> Are counters in this version broken? I run the query from CQLSH and get
> the same error every time. I tried running it with trace enabled and get
> nothing but the error:
>
> ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting 
> for replica nodes' responses] message="Operation timed out - received only 0 
> responses." info={'received_responses': 0, 'required_responses': 1, 
> 'consistency': 'ONE'}
>
>
> Any ideas?
>