Cassandra Resource Planning

2016-03-31 Thread Joe Hicks
I am doing resource planning and could use some help. How many operations
people will I need to manage my Cassandra implementation for two sites with
10 nodes at each site? As, my cluster grows at what point will I need to
add another person?


Re: Multi DC setup for analytics

2016-03-31 Thread Anishek Agarwal
Hey Bryan,

Thanks for the info, we inferred as much, currently the only other thing we
were trying were trying to start two separate instances in Analytics
cluster on same set of machines to talk to respective individual DC's but
within 2 mins dropped that as we will have to change ports on atlas one of
the existing DC's so when they join with the analytics cluster they are on
same port.

for now we are just getting another set of machines for this.


I had known about the pattern of using a separate analytics cluster for
cassandra but thought we could join them across two clusters, my bad now
that i think of it i think it would have been better to have just one DC
for realtime prod requests instead of two.

are there ways of merging existing clusters to one cluster in cassandra ?


On Fri, Apr 1, 2016 at 5:05 AM, Bryan Cheng  wrote:

> I'm jumping into this thread late, so sorry if this has been covered
> before. But am I correct in reading that you have two different Cassandra
> rings, not talking to each other at all, and you want to have a shared DC
> with a third Cassandra ring?
>
> I'm not sure what you want to do is possible.
>
> If I had the luxury of starting from scratch, the design I would do is:
> All three DC's in one cluster, with 3 datacenters. DC3 is the analytics DC.
> DC1's keyspaces are replicated to DC1 and DC3 only.
> DC2's keyspaces are replicated to DC2 and DC3 only.
>
> Then you have DC3 with all data from both DC1 and DC2 to run analytics on,
> and no cross-talk between DC1 and DC2.
>
> If you cannot rebuild your existing clusters, you may want to consider
> using something like Spark to ETL your data out of DC1 and DC2 into a new
> cluster at DC3. At that point you're running a data warehouse and lose some
> of the advantages of seemless cluster membership.
>
> On Wed, Mar 30, 2016 at 5:43 AM, Anishek Agarwal 
> wrote:
>
>> Hey Guys,
>>
>> We did the necessary changes and were trying to get this back on track,
>> but hit another wall,
>>
>> we have two Clusters in Different DC ( DC1 and DC2) with cluster names (
>> CLUSTER_1, CLUSTER_2)
>>
>> we want to have a common analytics cluster in DC3 with cluster name
>> (CLUSTER_3). -- looks like this can't be done, so we have to setup two
>> different analytics cluster ? can't we just get data from CLUSTER_1/2 to
>> same cluster CLUSTER_3 ?
>>
>> thanks
>> anishek
>>
>> On Mon, Mar 21, 2016 at 3:31 PM, Anishek Agarwal 
>> wrote:
>>
>>> Hey Clint,
>>>
>>> we have two separate rings which don't talk to each other but both
>>> having the same DC name "DCX".
>>>
>>> @Raja,
>>>
>>> We had already gone towards the path you suggested.
>>>
>>> thanks all
>>> anishek
>>>
>>> On Fri, Mar 18, 2016 at 8:01 AM, Reddy Raja 
>>> wrote:
>>>
 Yes. Here are the steps.
 You will have to change the DC Names first.
 DC1 and DC2 would be independent clusters.

 Create a new DC, DC3 and include these two DC's on DC3.

 This should work well.


 On Thu, Mar 17, 2016 at 11:03 PM, Clint Martin <
 clintlmar...@coolfiretechnologies.com> wrote:

> When you say you have two logical DC both with the same name are you
> saying that you have two clusters of servers both with the same DC name,
> nether of which currently talk to each other? IE they are two separate
> rings?
>
> Or do you mean that you have two keyspaces in one cluster?
>
> Or?
>
> Clint
> On Mar 14, 2016 2:11 AM, "Anishek Agarwal"  wrote:
>
>> Hello,
>>
>> We are using cassandra 2.0.17 and have two logical DC having
>> different Keyspaces but both having same logical name DC1.
>>
>> we want to setup another cassandra cluster for analytics which should
>> get data from both the above DC.
>>
>> if we setup the new DC with name DC2 and follow the steps
>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html
>> will it work ?
>>
>> I would think we would have to first change the names of existing
>> clusters to have to different names and then go with adding another dc
>> getting data from these?
>>
>> Also as soon as we add the node the data starts moving... this will
>> all be only real time changes done to the cluster right ? we still have 
>> to
>> do the rebuild to get the data for tokens for node in new cluster ?
>>
>> Thanks
>> Anishek
>>
>


 --
 "In this world, you either have an excuse or a story. I preferred to
 have a story"

>>>
>>>
>>
>


Re: Multi DC setup for analytics

2016-03-31 Thread Bryan Cheng
I'm jumping into this thread late, so sorry if this has been covered
before. But am I correct in reading that you have two different Cassandra
rings, not talking to each other at all, and you want to have a shared DC
with a third Cassandra ring?

I'm not sure what you want to do is possible.

If I had the luxury of starting from scratch, the design I would do is:
All three DC's in one cluster, with 3 datacenters. DC3 is the analytics DC.
DC1's keyspaces are replicated to DC1 and DC3 only.
DC2's keyspaces are replicated to DC2 and DC3 only.

Then you have DC3 with all data from both DC1 and DC2 to run analytics on,
and no cross-talk between DC1 and DC2.

If you cannot rebuild your existing clusters, you may want to consider
using something like Spark to ETL your data out of DC1 and DC2 into a new
cluster at DC3. At that point you're running a data warehouse and lose some
of the advantages of seemless cluster membership.

On Wed, Mar 30, 2016 at 5:43 AM, Anishek Agarwal  wrote:

> Hey Guys,
>
> We did the necessary changes and were trying to get this back on track,
> but hit another wall,
>
> we have two Clusters in Different DC ( DC1 and DC2) with cluster names (
> CLUSTER_1, CLUSTER_2)
>
> we want to have a common analytics cluster in DC3 with cluster name
> (CLUSTER_3). -- looks like this can't be done, so we have to setup two
> different analytics cluster ? can't we just get data from CLUSTER_1/2 to
> same cluster CLUSTER_3 ?
>
> thanks
> anishek
>
> On Mon, Mar 21, 2016 at 3:31 PM, Anishek Agarwal 
> wrote:
>
>> Hey Clint,
>>
>> we have two separate rings which don't talk to each other but both having
>> the same DC name "DCX".
>>
>> @Raja,
>>
>> We had already gone towards the path you suggested.
>>
>> thanks all
>> anishek
>>
>> On Fri, Mar 18, 2016 at 8:01 AM, Reddy Raja  wrote:
>>
>>> Yes. Here are the steps.
>>> You will have to change the DC Names first.
>>> DC1 and DC2 would be independent clusters.
>>>
>>> Create a new DC, DC3 and include these two DC's on DC3.
>>>
>>> This should work well.
>>>
>>>
>>> On Thu, Mar 17, 2016 at 11:03 PM, Clint Martin <
>>> clintlmar...@coolfiretechnologies.com> wrote:
>>>
 When you say you have two logical DC both with the same name are you
 saying that you have two clusters of servers both with the same DC name,
 nether of which currently talk to each other? IE they are two separate
 rings?

 Or do you mean that you have two keyspaces in one cluster?

 Or?

 Clint
 On Mar 14, 2016 2:11 AM, "Anishek Agarwal"  wrote:

> Hello,
>
> We are using cassandra 2.0.17 and have two logical DC having different
> Keyspaces but both having same logical name DC1.
>
> we want to setup another cassandra cluster for analytics which should
> get data from both the above DC.
>
> if we setup the new DC with name DC2 and follow the steps
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html
> will it work ?
>
> I would think we would have to first change the names of existing
> clusters to have to different names and then go with adding another dc
> getting data from these?
>
> Also as soon as we add the node the data starts moving... this will
> all be only real time changes done to the cluster right ? we still have to
> do the rebuild to get the data for tokens for node in new cluster ?
>
> Thanks
> Anishek
>

>>>
>>>
>>> --
>>> "In this world, you either have an excuse or a story. I preferred to
>>> have a story"
>>>
>>
>>
>


Re: Upgrade cassandra from 2.1.9 to 3.x?

2016-03-31 Thread adanec...@yahoo.com
Sure I have asked that before. I was told to get it from datsstax but why would 
the install not have that?
Sent from my Verizon 4G LTE Smartphone
-- Original message--From: John WongDate: Thu, Mar 31, 2016 3:20 PMTo: 
user@cassandra.apache.org;Tony Anecito;Cc: Subject:Re: Upgrade cassandra from 
2.1.9 to 3.x?
Even if you can upgrade Cassandra straight from 1.2 to 3.X, also consider 
driver compatibility.
On Thu, Mar 31, 2016 at 2:43 PM, Tony Anecito  wrote:
I would also like to know.
Thanks!
 

On Thursday, March 31, 2016 6:14 AM, Steven Choo  wrote:
  

 Hi,
Is it possible to update Cassandra from 2.1.9 to 3.x (e.g. 3.4) in one step?
The info about the tick-tock release schedule does not say anything specific 
about it and the documentation on http://docs.datastax.com is separated by 3.0 
and 3.x.Even though the info on 
https://github.com/apache/cassandra/blob/trunk/NEWS.txt is clear about updating 
from 2.1.9 to 3.0 and does not say anything specifically about the tick-tock 
versions, I want to be sure its possible due to the earlier upgrade 
requirements when upgrading between minor versions.
PS. I have seen 
http://www.mail-archive.com/user@cassandra.apache.org/msg46632.html but I`m 
more interested in, if its possible at all.
Best regards, Steven Choo  La VistaDunaBijsterveldenlaan 5
5045 ZZ TilburgNetherlandswww.iqnomy.com Tel:        013-3031160Email:   
ste...@iqnomy.com 

 


Re: Adding Options to Create Statements...

2016-03-31 Thread James Carman
No thoughts? Would an upgrade of the driver "fix" this?

On Wed, Mar 30, 2016 at 10:42 AM James Carman 
wrote:

> I am trying to perform the following operation:
>
> public Create createCreate() {
>   Create create =
> SchemaBuilder.createTable("foo").addPartitionColumn("bar",
> varchar()).addClusteringColumn("baz", varchar);
>   if(descending) {
> create.withOptions().clusteringOrder("baz", Direction.DESC);
>   return create;
> }
>
> I don't want to have to return the Create.Options object from this method
> (as I may need to add other columns).  Is there a way to have the options
> "decorate" the Create directly without having to return the Create.Options?
>
>


Re: Upgrade cassandra from 2.1.9 to 3.x?

2016-03-31 Thread John Wong
Even if you can upgrade Cassandra straight from 1.2 to 3.X, also consider
driver compatibility.

On Thu, Mar 31, 2016 at 2:43 PM, Tony Anecito  wrote:

> I would also like to know.
>
> Thanks!
>
>
> On Thursday, March 31, 2016 6:14 AM, Steven Choo 
> wrote:
>
>
> Hi,
>
> Is it possible to update Cassandra from 2.1.9 to 3.x (e.g. 3.4) in one
> step?
>
> The info about the tick-tock release schedule does not say anything
> specific about it and the documentation on http://docs.datastax.com is
> separated by 3.0 and 3.x.
> Even though the info on
> https://github.com/apache/cassandra/blob/trunk/NEWS.txt is clear about
> updating from 2.1.9 to 3.0 and does not say anything specifically about the
> tick-tock versions, I want to be sure its possible due to the earlier
> upgrade requirements when upgrading between minor versions.
>
> PS. I have seen
> http://www.mail-archive.com/user%40cassandra.apache.org/msg46632.html but
> I`m more interested in, if its possible at all.
>
> Best regards,
>
> Steven Choo
>
>
> La VistaDuna
> Bijsterveldenlaan 5
> 5045 ZZ Tilburg
> Netherlands
> www.iqnomy.com
>
> Tel:013-3031160
> Email:   ste...@iqnomy.com 
>
>
>


RE: Speeding up "nodetool rebuild"

2016-03-31 Thread Anubhav Kale
Thanks, is there any way to determine that rebuild is complete.

Based on following line in StorageService.java, it's not logged. So, any other 
way to check besides checking data size through nodetool status ? 

finally
{
// rebuild is done (successfully or not)
isRebuilding.set(false);
}


-Original Message-
From: Eric Evans [mailto:eev...@wikimedia.org] 
Sent: Thursday, March 31, 2016 9:50 AM
To: user@cassandra.apache.org
Subject: Re: Speeding up "nodetool rebuild"

On Wed, Mar 30, 2016 at 3:44 PM, Anubhav Kale  
wrote:
> Any other ways to make the “rebuild” faster ?

TL;DR add more nodes

If you're encountering a per-stream bottleneck (easy to do if using 
compression), then having a higher node count will translate to higher stream 
concurrency, and greater throughput.

Another thing to keep in mind, the streamthroughput value is *outbound*, it 
doesn't matter what you have that set to on the rebuilding/bootstrapping node, 
it *does* matter what it is set to on the nodes that are sending to it 
(https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-11303=01%7c01%7cAnubhav.Kale%40microsoft.com%7c27fd8203aa364253b6fc08d3598493a8%7c72f988bf86f141af91ab2d7cd011db47%7c1=rnPHvE12p04CnRXkHgD%2bkllLOqGA4gnlSuM3QsCTpDE%3d
 aims to introduce an inbound tunable though).


--
Eric Evans
eev...@wikimedia.org


Re: How many nodes do we require

2016-03-31 Thread Jack Krupansky
Maybe that's a great definition of a modern distributed cluster: each
person (node) has a different notion of priority.

I'll wait for the next user email in which they complain that their data is
"too stable" (missing updates.)

-- Jack Krupansky

On Thu, Mar 31, 2016 at 12:04 PM, Jacques-Henri Berthemet <
jacques-henri.berthe...@genesys.com> wrote:

> You’re right. I meant about data integrity, I understand it’s not
> everybody’s priority!
>
>
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* jeudi 31 mars 2016 17:48
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: How many nodes do we require
>
>
>
> Losing a write is very different from having a fragile cluster.  A fragile
> cluster implies that whole thing will fall apart, that it breaks easily.
> Writing at CL=ONE gives you a pretty damn stable cluster at the potential
> risk of losing a write that hasn't replicated (but has been ack'ed) which
> for a lot of people is preferable to downtime.  CL=ONE gives you the *most
> stable* cluster you can have.
>
> On Tue, Mar 29, 2016 at 12:57 AM Jacques-Henri Berthemet <
> jacques-henri.berthe...@genesys.com> wrote:
>
> Because if you lose a node you have chances to lose some data forever if
> it was not yet replicated.
>
>
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* vendredi 25 mars 2016 19:37
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: How many nodes do we require
>
>
>
> Why would using CL-ONE make your cluster fragile? This isn't obvious to
> me. It's the most practical setting for high availability, which very much
> says "not fragile".
>
> On Fri, Mar 25, 2016 at 10:44 AM Jacques-Henri Berthemet <
> jacques-henri.berthe...@genesys.com> wrote:
>
> I found this calculator very convenient:
> http://www.ecyrd.com/cassandracalculator/
>
> Regardless of your other DCs you need RF=3 if you write at LOCAL_QUORUM,
> RF=2 if you write/read at ONE.
>
> Obviously using ONE as CL makes your cluster very fragile.
> --
> Jacques-Henri Berthemet
>
>
> -Original Message-
> From: Rakesh Kumar [mailto:rakeshkumar46...@gmail.com]
> Sent: vendredi 25 mars 2016 18:14
> To: user@cassandra.apache.org
> Subject: Re: How many nodes do we require
>
> On Fri, Mar 25, 2016 at 11:45 AM, Jack Krupansky
>  wrote:
> > It depends on how much data you have. A single node can store a lot of
> data,
> > but the more data you have the longer a repair or node replacement will
> > take. How long can you tolerate for a full repair or node replacement?
>
> At this time, for a foreseeable future, size of data will not be
> significant. So we can safely disregard the above as a decision
> factor.
>
> >
> > Generally, RF=3 is both sufficient and recommended.
>
> Are you telling a SimpleReplication topology with RF=3
> or NetworkTopology with RF=3.
>
>
> taken from:
>
>
> https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html
>
> "
> Three replicas in each data center: This configuration tolerates
> either the failure of a one node per replication group at a strong
> consistency level of LOCAL_QUORUM or multiple node failures per data
> center using consistency level ONE."
>
> In our case, with only 3 nodes in each DC, wouldn't a RF=3 effectively
> mean ALL.
>
> I will state our requirement clearly:
>
> If we are going with six nodes (3 in each DC), we should be able to
> write even with a loss of one DC and loss of one node of the surviving
> DC. I am open to hearing what compromise we have to do with the reads
> during the time a DC is down. For us write is critical, more than
> reads.
>
> May be this is not possible with 6 nodes, and requires more.  Pls advise.
>
>


Re: Thrift composite partition key to cql migration

2016-03-31 Thread Tyler Hobbs
Also, can you paste the results of the relevant portions of "SELECT * FROM
system.schema_columns" and "SELECT * FROM system.schema_columnfamilies"?

On Thu, Mar 31, 2016 at 2:35 PM, Tyler Hobbs  wrote:

> In the Thrift schema, is the key_validation_class actually set to
> CompositeType(UTF8Type, UTF8Type), or is it just BytesType?  What Cassandra
> version?
>
> On Wed, Mar 30, 2016 at 4:44 PM, Jan Kesten  wrote:
>
>> Hi,
>>
>> while migrating the reminder of thrift operations in my application I
>> came across a point where I cant find a good hint.
>>
>> In our old code we used a composite with two strings as row / partition
>> key and a similar composite as column key like this:
>>
>> public Composite rowKey() {
>> final Composite composite = new Composite();
>> composite.addComponent(key1, StringSerializer.get());
>> composite.addComponent(key2, StringSerializer.get());
>> return composite;
>> }
>>
>> public Composite columnKey() {
>> final Composite composite = new Composite();
>> composite.addComponent(key3, StringSerializer.get());
>> composite.addComponent(key4, StringSerializer.get());
>> return composite;
>> }
>>
>> In cql this columnfamiliy looks like this:
>>
>> CREATE TABLE foo.bar (
>> key blob,
>> column1 text,
>> column2 text,
>> value blob,
>> PRIMARY KEY (key, column1, column2)
>> )
>>
>> For the columns key3 and key4 became column1 and column2 - but the old
>> rowkey is presented as blob (I can put it into a hex editor and see that
>> key1 and key2 values are in there).
>>
>> Any pointers to handle this or is this a known issue? I am using now
>> DataStax Java driver for CQL, old connector used thrift. Is there any way
>> to get key1 and key2 back apart from completly rewriting the table? This is
>> what I had expected it to be:
>>
>> CREATE TABLE foo.bar (
>> key1 text,
>> key2 text,
>> column1 text,
>> column2 text,
>> value blob,
>> PRIMARY KEY ((key1, key2), column1, column2)
>> )
>>
>> Cheers,
>> Jan
>>
>
>
>
> --
> Tyler Hobbs
> DataStax 
>



-- 
Tyler Hobbs
DataStax 


Re: Thrift composite partition key to cql migration

2016-03-31 Thread Tyler Hobbs
In the Thrift schema, is the key_validation_class actually set to
CompositeType(UTF8Type, UTF8Type), or is it just BytesType?  What Cassandra
version?

On Wed, Mar 30, 2016 at 4:44 PM, Jan Kesten  wrote:

> Hi,
>
> while migrating the reminder of thrift operations in my application I came
> across a point where I cant find a good hint.
>
> In our old code we used a composite with two strings as row / partition
> key and a similar composite as column key like this:
>
> public Composite rowKey() {
> final Composite composite = new Composite();
> composite.addComponent(key1, StringSerializer.get());
> composite.addComponent(key2, StringSerializer.get());
> return composite;
> }
>
> public Composite columnKey() {
> final Composite composite = new Composite();
> composite.addComponent(key3, StringSerializer.get());
> composite.addComponent(key4, StringSerializer.get());
> return composite;
> }
>
> In cql this columnfamiliy looks like this:
>
> CREATE TABLE foo.bar (
> key blob,
> column1 text,
> column2 text,
> value blob,
> PRIMARY KEY (key, column1, column2)
> )
>
> For the columns key3 and key4 became column1 and column2 - but the old
> rowkey is presented as blob (I can put it into a hex editor and see that
> key1 and key2 values are in there).
>
> Any pointers to handle this or is this a known issue? I am using now
> DataStax Java driver for CQL, old connector used thrift. Is there any way
> to get key1 and key2 back apart from completly rewriting the table? This is
> what I had expected it to be:
>
> CREATE TABLE foo.bar (
> key1 text,
> key2 text,
> column1 text,
> column2 text,
> value blob,
> PRIMARY KEY ((key1, key2), column1, column2)
> )
>
> Cheers,
> Jan
>



-- 
Tyler Hobbs
DataStax 


Re: Inconsistent query results and node state

2016-03-31 Thread Tyler Hobbs
On Thu, Mar 31, 2016 at 11:53 AM, Jason Kania  wrote:

>
> To me it just seems like the timestamp column value is sometimes not being
> set somewhere in the pipeline and the result is the epoch 0 value.
>

I agree, especially since you can't directly query this row and that
timestamp doesn't fit in the normal ordering.


>
> Thoughts on how to proceed?
>

Please open a ticket at https://issues.apache.org/jira/browse/CASSANDRA and
include your schema and queries.  If possible, it would also be extremely
helpful if you can upload the sstables for that table.


-- 
Tyler Hobbs
DataStax 


Re: Upgrade cassandra from 2.1.9 to 3.x?

2016-03-31 Thread Tony Anecito
I would also like to know.
Thanks!
 

On Thursday, March 31, 2016 6:14 AM, Steven Choo  wrote:
 

 Hi,
Is it possible to update Cassandra from 2.1.9 to 3.x (e.g. 3.4) in one step?
The info about the tick-tock release schedule does not say anything specific 
about it and the documentation on http://docs.datastax.com is separated by 3.0 
and 3.x.Even though the info on 
https://github.com/apache/cassandra/blob/trunk/NEWS.txt is clear about updating 
from 2.1.9 to 3.0 and does not say anything specifically about the tick-tock 
versions, I want to be sure its possible due to the earlier upgrade 
requirements when upgrading between minor versions.
PS. I have seen 
http://www.mail-archive.com/user%40cassandra.apache.org/msg46632.html but I`m 
more interested in, if its possible at all.
Best regards, Steven Choo  La VistaDunaBijsterveldenlaan 5
5045 ZZ TilburgNetherlandswww.iqnomy.com Tel:        013-3031160Email:   
ste...@iqnomy.com 

  

Re: Consistency Level (QUORUM vs LOCAL_QUORUM)

2016-03-31 Thread Robert Coli
On Thu, Mar 31, 2016 at 4:35 AM, Alain RODRIGUEZ  wrote:

> My understanding is using RF 3 and LOCAL_QUORUM for both reads and writes
> will provide a strong consistency and a high availability. One node can go
> down and also without lowering the consistency. Or RF = 5, Quorum = 3,
> allowing 2 nodes node if you need more availability / redundancy (Never saw
> that in use so far).
>

I agree that if one "actually" cared about TOTAL AVAILABILITY one would run
RF=5.

In that case, one can survive a "permanent" downtime (node hw crashes) and
a "temporary" downtime (momentary network partition of node) and still
serve at QUORUM.

I also agree that I've never seen anyone actually do this, probably because
they don't "actually" care... :D

=Rob


Re: NTP Synchronization Setup Changes

2016-03-31 Thread Eric Evans
On Wed, Mar 30, 2016 at 8:07 PM, Mukil Kesavan
 wrote:
> Are there any issues if this causes a huge time correction on the cassandra
> cluster? I know that NTP gradually corrects the time on all the servers. I
> just wanted to understand if there were any corner cases that will cause us
> to lose data/schema updates when this happens. In particular, we seem to be
> having some issues around missing secondary indices at the moment (not all
> but some).

As a thought experiment, imagine every scenario where it matters to
have one write occur after another (an update followed by a delete is
a good example).  Now imagine having your clock yanked backward to
correct for drift between the first such operation and the second.

I would strongly recommend you come up with a stable NTP setup.


-- 
Eric Evans
eev...@wikimedia.org


Re: Inconsistent query results and node state

2016-03-31 Thread Jason Kania
Thanks for responding. The problems that we are having are in Cassandra 3.03 
and 3.0.4. We had upgraded to see if the problem went away.

The values have been out of sync this way for some time and we cannot get a row 
with the 1969 timestamp in any query that directly queries on the timestamp. 
The 1969-12-31 19:00 value comes inconsistently in range queries but seems to 
be tied to the 192.168.10.9 node.
We tried the writetime function value in the query on time but it is not 
allowed as the time column is part of the primary key. Instead we used it on an 
additional field that is written at the same time (classId):
subscriberId  sensorUnitId  sensorId  time  
writetime(classId)   JASKAN 0  0  2015-05-24 02:09  
  1458178461272000
   JASKAN 0  0  1969-12-31 19:00    1458178801214000
   JASKAN 0  0  2016-01-21 02:10    1458178801221000
   JASKAN 0  0  2016-01-21 02:10    1458178801226000
   JASKAN 0  0  2016-01-21 02:10    1458178801231000
   JASKAN 0  0  2016-01-21 02:11    1458178801235000
   JASKAN 0  0  2016-01-21 02:22    1458178801241000
   JASKAN 0  0  2016-01-21 02:22    1458178801247000
   JASKAN 0  0  2016-01-21 02:22    1458178801252000
   JASKAN 0  0  2016-01-21 02:22    1458178801258000
Based on the other column values in the table row, we confirmed that the actual 
time in the row showing up with the 1969-12-31 19:00 timestamp is associated 
with the following timestamp.
 subscriberId  sensorUnitId  sensorId  time  
writetime(classId)
   JASKAN 0  0  2016-01-21 02:09    1458178801214000

The 2016-01-21 02:09 timestamp is always present on all nodes if queried 
directly based on using tracing.
To me it just seems like the timestamp column value is sometimes not being set 
somewhere in the pipeline and the result is the epoch 0 value.
Thoughts on how to proceed?
Thanks,

Jason

  From: Tyler Hobbs 
 To: user@cassandra.apache.org 
 Sent: Wednesday, March 30, 2016 11:31 AM
 Subject: Re: Inconsistent query results and node state
   

org.apache.cassandra.service.DigestMismatchException: Mismatch for key 
DecoratedKey(-4908797801227889951, 4a41534b414e) 
(6a6c8ab013d7757e702af50cbdae045c vs 2ece61a01b2a640ac10509f4c49ae6fb)

That key matches the row you mentioned, so it seems like all of the replicas 
should have converged on the same value for that row.  Do you consistently get 
the 1969-12-31 19:00 timestamp back now?  If not, try selecting both "time" and 
"writetime(time)}" from that row and see what write timestamps each of the 
values have.

The ArrayIndexOutOfBoundsException in response to nodetool compact looks like a 
bug.  What version of Cassandra are you running?
 
On Wed, Mar 30, 2016 at 9:59 AM, Kai Wang  wrote:

Do you have NTP setup on all nodes?

On Tue, Mar 29, 2016 at 11:48 PM, Jason Kania  wrote:

We have encountered a query inconsistency problem wherein the following query 
returns different results sporadically with invalid values for a timestamp 
field looking like the field is uninitialized (a zero timestamp) in the query 
results.

Attempts to repair and compact have not changed the results.

select "subscriberId","sensorUnitId","sensorId","time" from 
"sensorReadingIndex" where "subscriberId"='JASKAN' AND "sensorUnitId"=0 AND 
"sensorId"=0 ORDER BY "time" LIMIT 10;

Invalid Query Results
subscriberId    sensorUnitId    sensorId    time
JASKAN    0    0    2015-05-24 2:09
JASKAN    0    0    1969-12-31 19:00
JASKAN    0    0    2016-01-21 2:10
JASKAN    0    0    2016-01-21 2:10
JASKAN    0    0    2016-01-21 2:10
JASKAN    0    0    2016-01-21 2:11
JASKAN    0    0    2016-01-21 2:22
JASKAN    0    0    2016-01-21 2:22
JASKAN    0    0    2016-01-21 2:22
JASKAN    0    0    2016-01-21 2:22

Valid Query Results
subscriberId    sensorUnitId    sensorId    time
JASKAN    0    0    2015-05-24 2:09
JASKAN    0    0    2015-05-24 2:09
JASKAN    0    0    2015-05-24 2:10
JASKAN    0    0    2015-05-24 2:10
JASKAN    0    0    2015-05-24 2:10
JASKAN    0    0    2015-05-24 2:10
JASKAN    0    0    2015-05-24 2:11
JASKAN    0    0    2015-05-24 2:13
JASKAN    0    0    2015-05-24 2:13
JASKAN    0    0    2015-05-24 2:14

We have confirmed that the 1969-12-31 timestamp is not within the data based on 
running and number of queries so it looks like the invalid timestamp value is 
generated by the query. The query below returns no row.

select * from "sensorReadingIndex" where "subscriberId"='JASKAN' AND 
"sensorUnitId"=0 AND "sensorId"=0 AND time='1969-12-31 19:00:00-0500';

No logs are coming out but the following was observed intermittently in the 
tracing output, but not 

Re: Speeding up "nodetool rebuild"

2016-03-31 Thread Eric Evans
On Wed, Mar 30, 2016 at 3:44 PM, Anubhav Kale
 wrote:
> Any other ways to make the “rebuild” faster ?

TL;DR add more nodes

If you're encountering a per-stream bottleneck (easy to do if using
compression), then having a higher node count will translate to higher
stream concurrency, and greater throughput.

Another thing to keep in mind, the streamthroughput value is
*outbound*, it doesn't matter what you have that set to on the
rebuilding/bootstrapping node, it *does* matter what it is set to on
the nodes that are sending to it
(https://issues.apache.org/jira/browse/CASSANDRA-11303 aims to
introduce an inbound tunable though).


-- 
Eric Evans
eev...@wikimedia.org


Re: Inconsistent query results and node state

2016-03-31 Thread Jason Kania
Thanks for the response.

All nodes are using NTP.
Thanks,
Jason

  From: Kai Wang 
 To: user@cassandra.apache.org; Jason Kania  
 Sent: Wednesday, March 30, 2016 10:59 AM
 Subject: Re: Inconsistent query results and node state
   
Do you have NTP setup on all nodes?

On Tue, Mar 29, 2016 at 11:48 PM, Jason Kania  wrote:

We have encountered a query inconsistency problem wherein the following query 
returns different results sporadically with invalid values for a timestamp 
field looking like the field is uninitialized (a zero timestamp) in the query 
results.

Attempts to repair and compact have not changed the results.

select "subscriberId","sensorUnitId","sensorId","time" from 
"sensorReadingIndex" where "subscriberId"='JASKAN' AND "sensorUnitId"=0 AND 
"sensorId"=0 ORDER BY "time" LIMIT 10;

Invalid Query Results
subscriberId    sensorUnitId    sensorId    time
JASKAN    0    0    2015-05-24 2:09
JASKAN    0    0    1969-12-31 19:00
JASKAN    0    0    2016-01-21 2:10
JASKAN    0    0    2016-01-21 2:10
JASKAN    0    0    2016-01-21 2:10
JASKAN    0    0    2016-01-21 2:11
JASKAN    0    0    2016-01-21 2:22
JASKAN    0    0    2016-01-21 2:22
JASKAN    0    0    2016-01-21 2:22
JASKAN    0    0    2016-01-21 2:22

Valid Query Results
subscriberId    sensorUnitId    sensorId    time
JASKAN    0    0    2015-05-24 2:09
JASKAN    0    0    2015-05-24 2:09
JASKAN    0    0    2015-05-24 2:10
JASKAN    0    0    2015-05-24 2:10
JASKAN    0    0    2015-05-24 2:10
JASKAN    0    0    2015-05-24 2:10
JASKAN    0    0    2015-05-24 2:11
JASKAN    0    0    2015-05-24 2:13
JASKAN    0    0    2015-05-24 2:13
JASKAN    0    0    2015-05-24 2:14

We have confirmed that the 1969-12-31 timestamp is not within the data based on 
running and number of queries so it looks like the invalid timestamp value is 
generated by the query. The query below returns no row.

select * from "sensorReadingIndex" where "subscriberId"='JASKAN' AND 
"sensorUnitId"=0 AND "sensorId"=0 AND time='1969-12-31 19:00:00-0500';

No logs are coming out but the following was observed intermittently in the 
tracing output, but not correlated to the invalid query results:

 Digest mismatch: org.apache.cassandra.service.DigestMismatchException: 
Mismatch for key DecoratedKey(-7563144029910940626, 
00064a41534b414e040400) 
(be22d379c18f75c2f51dd6942d2f9356 vs da4e95d571b41303b908e0c5c3fff7ba) 
[ReadRepairStage:3179] | 2016-03-29 23:12:35.025000 | 192.168.10.10 |
An error from the debug log that might be related is:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key 
DecoratedKey(-4908797801227889951, 4a41534b414e) 
(6a6c8ab013d7757e702af50cbdae045c vs 2ece61a01b2a640ac10509f4c49ae6fb)
    at 
org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85) 
~[apache-cassandra-3.0.3.jar:3.0.3]
    at 
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:225)
 ~[apache-cassandra-3.0.3.jar:3.0.3]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_74]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_74]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74]

The tracing files are attached and seem to show that in the failed case, 
content is skipped because of tombstones if we understand it correctly. This 
could be an inconsistency problem on 192.168.10.9 Unfortunately, attempts to 
compact on 192.168.10.9 only give the following error without any stack trace 
detail and are not fixed with repair.

root@cutthroat:/usr/local/bin/analyzer/bin# nodetool compact
error: null
-- StackTrace --
java.lang.ArrayIndexOutOfBoundsException
Any suggestions on how to fix or what to search for would be much appreciated.
Thanks,
Jason







  

RE: How many nodes do we require

2016-03-31 Thread Jacques-Henri Berthemet
You’re right. I meant about data integrity, I understand it’s not everybody’s 
priority!

--
Jacques-Henri Berthemet

From: Jonathan Haddad [mailto:j...@jonhaddad.com]
Sent: jeudi 31 mars 2016 17:48
To: user@cassandra.apache.org
Subject: Re: How many nodes do we require

Losing a write is very different from having a fragile cluster.  A fragile 
cluster implies that whole thing will fall apart, that it breaks easily.  
Writing at CL=ONE gives you a pretty damn stable cluster at the potential risk 
of losing a write that hasn't replicated (but has been ack'ed) which for a lot 
of people is preferable to downtime.  CL=ONE gives you the *most stable* 
cluster you can have.
On Tue, Mar 29, 2016 at 12:57 AM Jacques-Henri Berthemet 
>
 wrote:
Because if you lose a node you have chances to lose some data forever if it was 
not yet replicated.

--
Jacques-Henri Berthemet

From: Jonathan Haddad [mailto:j...@jonhaddad.com]
Sent: vendredi 25 mars 2016 19:37

To: user@cassandra.apache.org
Subject: Re: How many nodes do we require

Why would using CL-ONE make your cluster fragile? This isn't obvious to me. 
It's the most practical setting for high availability, which very much says 
"not fragile".
On Fri, Mar 25, 2016 at 10:44 AM Jacques-Henri Berthemet 
>
 wrote:
I found this calculator very convenient:
http://www.ecyrd.com/cassandracalculator/

Regardless of your other DCs you need RF=3 if you write at LOCAL_QUORUM, RF=2 
if you write/read at ONE.

Obviously using ONE as CL makes your cluster very fragile.
--
Jacques-Henri Berthemet


-Original Message-
From: Rakesh Kumar 
[mailto:rakeshkumar46...@gmail.com]
Sent: vendredi 25 mars 2016 18:14
To: user@cassandra.apache.org
Subject: Re: How many nodes do we require

On Fri, Mar 25, 2016 at 11:45 AM, Jack Krupansky
> wrote:
> It depends on how much data you have. A single node can store a lot of data,
> but the more data you have the longer a repair or node replacement will
> take. How long can you tolerate for a full repair or node replacement?

At this time, for a foreseeable future, size of data will not be
significant. So we can safely disregard the above as a decision
factor.

>
> Generally, RF=3 is both sufficient and recommended.

Are you telling a SimpleReplication topology with RF=3
or NetworkTopology with RF=3.


taken from:

https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html

"
Three replicas in each data center: This configuration tolerates
either the failure of a one node per replication group at a strong
consistency level of LOCAL_QUORUM or multiple node failures per data
center using consistency level ONE."

In our case, with only 3 nodes in each DC, wouldn't a RF=3 effectively mean ALL.

I will state our requirement clearly:

If we are going with six nodes (3 in each DC), we should be able to
write even with a loss of one DC and loss of one node of the surviving
DC. I am open to hearing what compromise we have to do with the reads
during the time a DC is down. For us write is critical, more than
reads.

May be this is not possible with 6 nodes, and requires more.  Pls advise.


Re: How many nodes do we require

2016-03-31 Thread Jonathan Haddad
Losing a write is very different from having a fragile cluster.  A fragile
cluster implies that whole thing will fall apart, that it breaks easily.
Writing at CL=ONE gives you a pretty damn stable cluster at the potential
risk of losing a write that hasn't replicated (but has been ack'ed) which
for a lot of people is preferable to downtime.  CL=ONE gives you the *most
stable* cluster you can have.

On Tue, Mar 29, 2016 at 12:57 AM Jacques-Henri Berthemet <
jacques-henri.berthe...@genesys.com> wrote:

> Because if you lose a node you have chances to lose some data forever if
> it was not yet replicated.
>
>
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* vendredi 25 mars 2016 19:37
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: How many nodes do we require
>
>
>
> Why would using CL-ONE make your cluster fragile? This isn't obvious to
> me. It's the most practical setting for high availability, which very much
> says "not fragile".
>
> On Fri, Mar 25, 2016 at 10:44 AM Jacques-Henri Berthemet <
> jacques-henri.berthe...@genesys.com> wrote:
>
> I found this calculator very convenient:
> http://www.ecyrd.com/cassandracalculator/
>
> Regardless of your other DCs you need RF=3 if you write at LOCAL_QUORUM,
> RF=2 if you write/read at ONE.
>
> Obviously using ONE as CL makes your cluster very fragile.
> --
> Jacques-Henri Berthemet
>
>
> -Original Message-
> From: Rakesh Kumar [mailto:rakeshkumar46...@gmail.com]
> Sent: vendredi 25 mars 2016 18:14
> To: user@cassandra.apache.org
> Subject: Re: How many nodes do we require
>
> On Fri, Mar 25, 2016 at 11:45 AM, Jack Krupansky
>  wrote:
> > It depends on how much data you have. A single node can store a lot of
> data,
> > but the more data you have the longer a repair or node replacement will
> > take. How long can you tolerate for a full repair or node replacement?
>
> At this time, for a foreseeable future, size of data will not be
> significant. So we can safely disregard the above as a decision
> factor.
>
> >
> > Generally, RF=3 is both sufficient and recommended.
>
> Are you telling a SimpleReplication topology with RF=3
> or NetworkTopology with RF=3.
>
>
> taken from:
>
>
> https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html
>
> "
> Three replicas in each data center: This configuration tolerates
> either the failure of a one node per replication group at a strong
> consistency level of LOCAL_QUORUM or multiple node failures per data
> center using consistency level ONE."
>
> In our case, with only 3 nodes in each DC, wouldn't a RF=3 effectively
> mean ALL.
>
> I will state our requirement clearly:
>
> If we are going with six nodes (3 in each DC), we should be able to
> write even with a loss of one DC and loss of one node of the surviving
> DC. I am open to hearing what compromise we have to do with the reads
> during the time a DC is down. For us write is critical, more than
> reads.
>
> May be this is not possible with 6 nodes, and requires more.  Pls advise.
>
>


Re: auto_boorstrap when a node is down

2016-03-31 Thread Carlos Alonso
Mmm ok, then I think you may need follow the standard dead node replacement
procedure:
https://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsReplaceNode.html

Cheers!

Carlos Alonso | Software Engineer | @calonso 

On 31 March 2016 at 16:34, Peddi, Praveen  wrote:

> Hi Carlos,
> In our case, old node is dead and is not accessible. So I am not sure if
> we can use rsync in this case.
>
> Praveen
>
>
> From: Carlos Alonso 
> Reply-To: "user@cassandra.apache.org" 
> Date: Thursday, March 31, 2016 at 10:31 AM
> To: "user@cassandra.apache.org" 
> Subject: Re: auto_boorstrap when a node is down
>
> If that's your use case I've developed a quick disk based replacement
> procedure.
>
> Basically all it involves is rsyncing the data from the old node to the
> new node and bring the new one as if it was the old one (only the IP will
> change). Step by step details here:
> http://mrcalonso.com/cassandra-instantaneous-in-place-node-replacement/
>
> We've been using it for a while and works nicely and avoids the time,
> resources and baby-sitting consumption of streaming data across nodes.
>
> Regards
>
> Carlos Alonso | Software Engineer | @calonso 
>
> On 31 March 2016 at 15:26, Peddi, Praveen  wrote:
>
>> Hi Paulo,
>> Thanks a lot for detailed explanation. Our usecase is that, when one node
>> goes down, a new node in the same AZ comes up immediately (5 to 10 mins)
>> and it is safe to assume that no other nodes in another AZ are down at this
>> point of time. So based on your explanation, using
>> -Dcassandra.consistent.rangemovement=false seems like the way to go for our
>> usecase. I will test it with that option.
>>
>> Thanks again.
>>
>> Praveen
>>
>>
>>
>> From: Paulo Motta 
>> Reply-To: "user@cassandra.apache.org" 
>> Date: Wednesday, March 30, 2016 at 10:55 AM
>> To: "user@cassandra.apache.org" 
>> Subject: Re: auto_boorstrap when a node is down
>>
>> When you add a node it will take over the range of an existing node, and
>> thus it should stream data from it to maintain consistency. If the existing
>> node is unavailable, the new node may fetch the data from a different
>> replica, which may not have some of data from the node which you are taking
>> the range for, what may break consistency.
>>
>> For example, imagine a ring with nodes A, B and C, RF=3. The row X=1 maps
>> to node A and is replicated in nodes B and C, so the initial arrangement
>> will be:
>>
>> A(X=1), B(X=1) and C(X=1)
>>
>> Node B is down and you write X=2 to A, which replicates the data only to
>> C since B is down (and hinted handoff is disabled). The write succeeds at
>> QUORUM. The new arragement becomes:
>>
>> A(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM
>> will fetch the correct value of X=2)
>>
>> Now imagine you add a new node D between A and B. If D streams data from
>> A, the new replica group will become:
>>
>> A, D(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at
>> QUORUM will fetch the correct value of X=2)
>>
>> But if A is down when you bootstrap D and you have
>> -Dcassandra.consistent.rangemovement=false, D may stream data from B, so
>> the new replica group will be:
>>
>> A, D(X=1), B(X=1), C(X=2)
>>
>> Now, if C becomes down, reads at QUORUM will succeed but return the stale
>> value of X=1, so consistency is broken.
>>
>> If you're continuously running repair, have hinted handoff and read
>> repair enabled, the probability of something like this happening will
>> decrease, but it may still happen. If this is not a problem you may use
>> option -Dcassandra.consistent.rangemovement=false to bootstrap a node when
>> another node is down. See CASSANDRA-2434 for more background.
>>
>> 2016-03-30 11:14 GMT-03:00 Peddi, Praveen :
>>
>>> Hello all,
>>> We just upgraded to 2.2.4 (from 2.0.9) and we noticed one issue when new
>>> nodes are added. When we add a new node when no nodes are down in the
>>> cluster, everything works fine but when we add new node while 1 node is
>>> down, I am seeing following error. My understanding was when auto_bootstrap
>>> is enabled, bootstrapping process uses QUORUM consistency so it should work
>>> when one node is down. Is that not correct? Is there a way to add a new
>>> node with bootstrapping, but not using replace address option? We use auto
>>> scaling and new node gets added automatically when one node goes down and
>>> since its all scripted I can’t use replace address in cassandra-env.sh file
>>> as a one-time option.
>>>
>>> One fallback mechanism we could use is to disable auto bootstrap and let
>>> read repairs populate the data over time but its not ideal. Is this even a
>>> good alternative to this failure?
>>>
>>> ERROR 20:30:45 Exception encountered during 

Re: auto_boorstrap when a node is down

2016-03-31 Thread Peddi, Praveen
Hi Carlos,
In our case, old node is dead and is not accessible. So I am not sure if we can 
use rsync in this case.

Praveen


From: Carlos Alonso >
Reply-To: "user@cassandra.apache.org" 
>
Date: Thursday, March 31, 2016 at 10:31 AM
To: "user@cassandra.apache.org" 
>
Subject: Re: auto_boorstrap when a node is down

If that's your use case I've developed a quick disk based replacement procedure.

Basically all it involves is rsyncing the data from the old node to the new 
node and bring the new one as if it was the old one (only the IP will change). 
Step by step details here: 
http://mrcalonso.com/cassandra-instantaneous-in-place-node-replacement/

We've been using it for a while and works nicely and avoids the time, resources 
and baby-sitting consumption of streaming data across nodes.

Regards

Carlos Alonso | Software Engineer | @calonso

On 31 March 2016 at 15:26, Peddi, Praveen 
> wrote:
Hi Paulo,
Thanks a lot for detailed explanation. Our usecase is that, when one node goes 
down, a new node in the same AZ comes up immediately (5 to 10 mins) and it is 
safe to assume that no other nodes in another AZ are down at this point of 
time. So based on your explanation, using 
-Dcassandra.consistent.rangemovement=false seems like the way to go for our 
usecase. I will test it with that option.

Thanks again.

Praveen





From: Paulo Motta >
Reply-To: "user@cassandra.apache.org" 
>
Date: Wednesday, March 30, 2016 at 10:55 AM
To: "user@cassandra.apache.org" 
>
Subject: Re: auto_boorstrap when a node is down

When you add a node it will take over the range of an existing node, and thus 
it should stream data from it to maintain consistency. If the existing node is 
unavailable, the new node may fetch the data from a different replica, which 
may not have some of data from the node which you are taking the range for, 
what may break consistency.

For example, imagine a ring with nodes A, B and C, RF=3. The row X=1 maps to 
node A and is replicated in nodes B and C, so the initial arrangement will be:

A(X=1), B(X=1) and C(X=1)

Node B is down and you write X=2 to A, which replicates the data only to C 
since B is down (and hinted handoff is disabled). The write succeeds at QUORUM. 
The new arragement becomes:

A(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM will 
fetch the correct value of X=2)

Now imagine you add a new node D between A and B. If D streams data from A, the 
new replica group will become:

A, D(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM will 
fetch the correct value of X=2)

But if A is down when you bootstrap D and you have 
-Dcassandra.consistent.rangemovement=false, D may stream data from B, so the 
new replica group will be:

A, D(X=1), B(X=1), C(X=2)

Now, if C becomes down, reads at QUORUM will succeed but return the stale value 
of X=1, so consistency is broken.

If you're continuously running repair, have hinted handoff and read repair 
enabled, the probability of something like this happening will decrease, but it 
may still happen. If this is not a problem you may use option 
-Dcassandra.consistent.rangemovement=false to bootstrap a node when another 
node is down. See CASSANDRA-2434 for more background.

2016-03-30 11:14 GMT-03:00 Peddi, Praveen 
>:
Hello all,
We just upgraded to 2.2.4 (from 2.0.9) and we noticed one issue when new nodes 
are added. When we add a new node when no nodes are down in the cluster, 
everything works fine but when we add new node while 1 node is down, I am 
seeing following error. My understanding was when auto_bootstrap is enabled, 
bootstrapping process uses QUORUM consistency so it should work when one node 
is down. Is that not correct? Is there a way to add a new node with 
bootstrapping, but not using replace address option? We use auto scaling and 
new node gets added automatically when one node goes down and since its all 
scripted I can't use replace address in cassandra-env.sh file as a one-time 
option.

One fallback mechanism we could use is to disable auto bootstrap and let read 
repairs populate the data over time but its not ideal. Is this even a good 
alternative to this failure?

ERROR 20:30:45 Exception encountered during startup
java.lang.RuntimeException: A node required to move the data consistently is 
down (/xx.xx.xx.xx). If you wish to move the data from a potentially 
inconsistent replica, restart 

Re: auto_boorstrap when a node is down

2016-03-31 Thread Carlos Alonso
If that's your use case I've developed a quick disk based replacement
procedure.

Basically all it involves is rsyncing the data from the old node to the new
node and bring the new one as if it was the old one (only the IP will
change). Step by step details here:
http://mrcalonso.com/cassandra-instantaneous-in-place-node-replacement/

We've been using it for a while and works nicely and avoids the time,
resources and baby-sitting consumption of streaming data across nodes.

Regards

Carlos Alonso | Software Engineer | @calonso 

On 31 March 2016 at 15:26, Peddi, Praveen  wrote:

> Hi Paulo,
> Thanks a lot for detailed explanation. Our usecase is that, when one node
> goes down, a new node in the same AZ comes up immediately (5 to 10 mins)
> and it is safe to assume that no other nodes in another AZ are down at this
> point of time. So based on your explanation, using
> -Dcassandra.consistent.rangemovement=false seems like the way to go for our
> usecase. I will test it with that option.
>
> Thanks again.
>
> Praveen
>
>
>
> From: Paulo Motta 
> Reply-To: "user@cassandra.apache.org" 
> Date: Wednesday, March 30, 2016 at 10:55 AM
> To: "user@cassandra.apache.org" 
> Subject: Re: auto_boorstrap when a node is down
>
> When you add a node it will take over the range of an existing node, and
> thus it should stream data from it to maintain consistency. If the existing
> node is unavailable, the new node may fetch the data from a different
> replica, which may not have some of data from the node which you are taking
> the range for, what may break consistency.
>
> For example, imagine a ring with nodes A, B and C, RF=3. The row X=1 maps
> to node A and is replicated in nodes B and C, so the initial arrangement
> will be:
>
> A(X=1), B(X=1) and C(X=1)
>
> Node B is down and you write X=2 to A, which replicates the data only to C
> since B is down (and hinted handoff is disabled). The write succeeds at
> QUORUM. The new arragement becomes:
>
> A(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM
> will fetch the correct value of X=2)
>
> Now imagine you add a new node D between A and B. If D streams data from
> A, the new replica group will become:
>
> A, D(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM
> will fetch the correct value of X=2)
>
> But if A is down when you bootstrap D and you have
> -Dcassandra.consistent.rangemovement=false, D may stream data from B, so
> the new replica group will be:
>
> A, D(X=1), B(X=1), C(X=2)
>
> Now, if C becomes down, reads at QUORUM will succeed but return the stale
> value of X=1, so consistency is broken.
>
> If you're continuously running repair, have hinted handoff and read repair
> enabled, the probability of something like this happening will decrease,
> but it may still happen. If this is not a problem you may use option
> -Dcassandra.consistent.rangemovement=false to bootstrap a node when another
> node is down. See CASSANDRA-2434 for more background.
>
> 2016-03-30 11:14 GMT-03:00 Peddi, Praveen :
>
>> Hello all,
>> We just upgraded to 2.2.4 (from 2.0.9) and we noticed one issue when new
>> nodes are added. When we add a new node when no nodes are down in the
>> cluster, everything works fine but when we add new node while 1 node is
>> down, I am seeing following error. My understanding was when auto_bootstrap
>> is enabled, bootstrapping process uses QUORUM consistency so it should work
>> when one node is down. Is that not correct? Is there a way to add a new
>> node with bootstrapping, but not using replace address option? We use auto
>> scaling and new node gets added automatically when one node goes down and
>> since its all scripted I can’t use replace address in cassandra-env.sh file
>> as a one-time option.
>>
>> One fallback mechanism we could use is to disable auto bootstrap and let
>> read repairs populate the data over time but its not ideal. Is this even a
>> good alternative to this failure?
>>
>> ERROR 20:30:45 Exception encountered during startup
>> java.lang.RuntimeException: A node required to move the data consistently
>> is down (/xx.xx.xx.xx). If you wish to move the data from a potentially
>> inconsistent replica, restart the node with
>> -Dcassandra.consistent.rangemovement=false
>>
>> Praveen
>>
>
>


Re: Upgrade cassandra from 2.1.9 to 3.x?

2016-03-31 Thread Paulo Motta
If there isn't anything on NEWS.txt forbidding it, then it *should* be
possible.  That is the authoritative source for upgrade information. As
noted by you, the only known restriction is that you upgrade from at least
2.1.9 as noted in the NEWS.txt entry.

But as always, and specially when doing an upgrade between major versions,
make sure to test the upgrade in a staging environment first, as well as
snapshot/backup your tables before the process.

2016-03-31 9:14 GMT-03:00 Steven Choo :

> Hi,
>
> Is it possible to update Cassandra from 2.1.9 to 3.x (e.g. 3.4) in one
> step?
>
> The info about the tick-tock release schedule does not say anything
> specific about it and the documentation on http://docs.datastax.com is
> separated by 3.0 and 3.x.
> Even though the info on
> https://github.com/apache/cassandra/blob/trunk/NEWS.txt is clear about
> updating from 2.1.9 to 3.0 and does not say anything specifically about the
> tick-tock versions, I want to be sure its possible due to the earlier
> upgrade requirements when upgrading between minor versions.
>
> PS. I have seen
> http://www.mail-archive.com/user%40cassandra.apache.org/msg46632.html but
> I`m more interested in, if its possible at all.
>
> Best regards,
>
>
>
> Steven Choo
>
>
>
>
>
> La VistaDuna
>
> Bijsterveldenlaan 5
>
> 5045 ZZ Tilburg
>
> Netherlands
>
> www.iqnomy.com
>
>
>
> Tel:013-3031160
>
> Email:   ste...@iqnomy.com 
>


Re: auto_boorstrap when a node is down

2016-03-31 Thread Peddi, Praveen
Hi Paulo,
Thanks a lot for detailed explanation. Our usecase is that, when one node goes 
down, a new node in the same AZ comes up immediately (5 to 10 mins) and it is 
safe to assume that no other nodes in another AZ are down at this point of 
time. So based on your explanation, using 
-Dcassandra.consistent.rangemovement=false seems like the way to go for our 
usecase. I will test it with that option.

Thanks again.

Praveen





From: Paulo Motta >
Reply-To: "user@cassandra.apache.org" 
>
Date: Wednesday, March 30, 2016 at 10:55 AM
To: "user@cassandra.apache.org" 
>
Subject: Re: auto_boorstrap when a node is down

When you add a node it will take over the range of an existing node, and thus 
it should stream data from it to maintain consistency. If the existing node is 
unavailable, the new node may fetch the data from a different replica, which 
may not have some of data from the node which you are taking the range for, 
what may break consistency.

For example, imagine a ring with nodes A, B and C, RF=3. The row X=1 maps to 
node A and is replicated in nodes B and C, so the initial arrangement will be:

A(X=1), B(X=1) and C(X=1)

Node B is down and you write X=2 to A, which replicates the data only to C 
since B is down (and hinted handoff is disabled). The write succeeds at QUORUM. 
The new arragement becomes:

A(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM will 
fetch the correct value of X=2)

Now imagine you add a new node D between A and B. If D streams data from A, the 
new replica group will become:

A, D(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM will 
fetch the correct value of X=2)

But if A is down when you bootstrap D and you have 
-Dcassandra.consistent.rangemovement=false, D may stream data from B, so the 
new replica group will be:

A, D(X=1), B(X=1), C(X=2)

Now, if C becomes down, reads at QUORUM will succeed but return the stale value 
of X=1, so consistency is broken.

If you're continuously running repair, have hinted handoff and read repair 
enabled, the probability of something like this happening will decrease, but it 
may still happen. If this is not a problem you may use option 
-Dcassandra.consistent.rangemovement=false to bootstrap a node when another 
node is down. See CASSANDRA-2434 for more background.

2016-03-30 11:14 GMT-03:00 Peddi, Praveen 
>:
Hello all,
We just upgraded to 2.2.4 (from 2.0.9) and we noticed one issue when new nodes 
are added. When we add a new node when no nodes are down in the cluster, 
everything works fine but when we add new node while 1 node is down, I am 
seeing following error. My understanding was when auto_bootstrap is enabled, 
bootstrapping process uses QUORUM consistency so it should work when one node 
is down. Is that not correct? Is there a way to add a new node with 
bootstrapping, but not using replace address option? We use auto scaling and 
new node gets added automatically when one node goes down and since its all 
scripted I can't use replace address in cassandra-env.sh file as a one-time 
option.

One fallback mechanism we could use is to disable auto bootstrap and let read 
repairs populate the data over time but its not ideal. Is this even a good 
alternative to this failure?

ERROR 20:30:45 Exception encountered during startup
java.lang.RuntimeException: A node required to move the data consistently is 
down (/xx.xx.xx.xx). If you wish to move the data from a potentially 
inconsistent replica, restart the node with 
-Dcassandra.consistent.rangemovement=false

Praveen



Re: StatusLogger output

2016-03-31 Thread Vasileios Vlachos
Anyone else any idea on how to interpret StatusLogger output? As Sean said,
this may not help in determining the problem, but it would definitely help
my general understanding.

Thanks,
Bill

On Thu, Mar 24, 2016 at 5:24 PM, Vasileios Vlachos <
vasileiosvlac...@gmail.com> wrote:

> Thanks for your help Sean,
>
> The reason StatusLogger messages appear in the logs is usually, as you
> said, a GC pause (ParNew or CMS, I have seen both), or dropped messages. In
> our case dropped messages are always (so far) due to internal timeouts, not
> due to cross node timeouts (like the sample output in the link I provided
> earlier). I have seen StatusLogger output during low traffic times and I
> cannot say that we seem to have more logs during high-traffic hours.
>
> We use Nagios for monitoring and have several checks for cassandra (we use
> the JMX console for each node). However, most graphs are averaged out. I
> can see some spikes at the times, however, these spikes only go around
> 20-30% of the load we get during high-traffic times. The only time we have
> seen nodes marked down in the logs is when there is some severe cross-DC
> VPN issue, which is not something that happens often and does not correlate
> with StatusLogger output either.
>
> Regarding GC, we only see up to 10 GC pauses per day in the logs (I ofc
> mean over 200ms which is the threshold for logging GC events by default).
> We are actually experimenting with GC these days on one of the nodes, but I
> cannot say this has made things worse/better.
>
> I was hoping that by understanding the StatusLogger output better I'd be
> able to investigate further. We monitor metrics like hints, pending
> tasks, reads/writes per CF, read/write latency/CF, compactions,
> connections/node. If there is anything from the JMX console that you would
> suggest I should be monitoring, please let me know. I was thinking
> compactions may be the reason for this (so, I/O could be the bottleneck),
> but looking at the graphs I can see that when a node compacts its CPU usage
> would only max at around 20-30% and would only add 2-5ms of read/write
> latency per CF (if any).
>
> Thanks,
> Vasilis
>
> On Thu, Mar 24, 2016 at 3:31 PM,  wrote:
>
>> I am not sure the status logger output helps determine the problem.
>> However, the dropped mutations and the status logger output is what I see
>> when there is too high of a load on one or more Cassandra nodes. It could
>> be long GC pauses, something reading too much data (a large row or a
>> multi-partition query), or just too many requests for the number of nodes
>> you have. Are you using OpsCenter to monitor the rings? Do you have read or
>> write spikes at the time? Any GC messages in the log. Any nodes going down
>> at the time?
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Vasileios Vlachos [mailto:vasileiosvlac...@gmail.com]
>> *Sent:* Thursday, March 24, 2016 8:13 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: StatusLogger output
>>
>>
>>
>> Just to clarify, I can see line 29 which seems to explain the format
>> (first number ops, second is data), however I don't know they actually
>> mean.
>>
>>
>>
>> Thanks,
>>
>> Vasilis
>>
>>
>>
>> On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
>> vasileiosvlac...@gmail.com> wrote:
>>
>> Hello,
>>
>>
>>
>> Environment:
>>
>> - Cassandra 2.0.17, 8 nodes, 4 per DC
>>
>> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>>
>>
>>
>> Every node seems to be dropping messages (anywhere from 10 to 300) twice
>> a day. I don't know it this has always been the case, but has definitely
>> been going for the past month or so. Whenever that happens we get
>> StatusLogger.java output in the log, which is the state of the node at
>> the time it dropped messages. This output contains information
>> similar/identical to nodetool tpstats, but further from that,
>> information regarding system CF follows as can be seen here:
>> http://ur1.ca/ooan6
>>
>>
>>
>> How can we use this information to find out what the problem was? I am
>> specifically referring to the information regarding the system CF. I had a
>> look in the system tables but I cannot draw anything from that. The output
>> in the log seems to contain two values (comma separated). What are these
>> numbers?
>>
>>
>>
>> I wasn't able to find anything on the web/DataStax docs. Any help would
>> be greatly appreciated!
>>
>>
>>
>> Thanks,
>>
>> Vasilis
>>
>>
>>
>> --
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed 

Upgrade cassandra from 2.1.9 to 3.x?

2016-03-31 Thread Steven Choo
Hi,

Is it possible to update Cassandra from 2.1.9 to 3.x (e.g. 3.4) in one step?

The info about the tick-tock release schedule does not say anything
specific about it and the documentation on http://docs.datastax.com is
separated by 3.0 and 3.x.
Even though the info on
https://github.com/apache/cassandra/blob/trunk/NEWS.txt is clear about
updating from 2.1.9 to 3.0 and does not say anything specifically about the
tick-tock versions, I want to be sure its possible due to the earlier
upgrade requirements when upgrading between minor versions.

PS. I have seen
http://www.mail-archive.com/user%40cassandra.apache.org/msg46632.html but
I`m more interested in, if its possible at all.

Best regards,



Steven Choo





La VistaDuna

Bijsterveldenlaan 5

5045 ZZ Tilburg

Netherlands

www.iqnomy.com



Tel:013-3031160

Email:   ste...@iqnomy.com 


Re: How many nodes do we require

2016-03-31 Thread Alain RODRIGUEZ
@Rakesh:

Are you telling a SimpleReplication topology with RF=3
> or NetworkTopology with RF=3.


Just always go with NetworkTopology, I see no reason not doing it nowadays,
even on test clusters. If you use SimpleReplicationTopology, all the
machines will be considered as only one datacenter, no matters their
localisation or configuration which can lead to a lot of issues.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-31 14:03 GMT+02:00 Alain RODRIGUEZ :

> Hi,
>
> Because if you lose a node you have chances to lose some data forever if
>> it was not yet replicated.
>
>
> I think I get your point, but keep in mind that CL ONE (or LOCAL_ONE) will
> not prevent the coordinator from sending the data to the 2 other replicas,
> it will just wait for the first ack, but all the nodes are supposed to have
> the write. There are also the hints system, read repairs and repairs. If
> none of this work, the problem is probably bigger than just using CL ONE.
>
> I would advice using QUORUM if a high consistency is important to you. To
> spare some space using RF = 2 or to focus at low latency over consistency.
> Be carful then, loosing consistency means that returned data might change
> depending on the node you hit, if some nodes lose information and entropy
> start being high, same query running twice at the same time can return 2
> distinct values depending where they read from. Generally, using RF = 3 and
> CL = LOCAL_QUORUM for both reads and writes is the best (safest?) option,
> but some people happily run with RF = 2 and CL = LOCAL_ONE.
>
> we should be able to run with the loss of one DC and loss of at most one
>> node in the surviving DC
>
>
> You can parameter that on client side on modern clients using the native
> protocol and there are some consistency considerations there too.
>
> Hope I am correct about all that :-).
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-03-29 9:56 GMT+02:00 Jacques-Henri Berthemet <
> jacques-henri.berthe...@genesys.com>:
>
>> Because if you lose a node you have chances to lose some data forever if
>> it was not yet replicated.
>>
>>
>>
>> *--*
>>
>> *Jacques-Henri Berthemet*
>>
>>
>>
>> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
>> *Sent:* vendredi 25 mars 2016 19:37
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: How many nodes do we require
>>
>>
>>
>> Why would using CL-ONE make your cluster fragile? This isn't obvious to
>> me. It's the most practical setting for high availability, which very much
>> says "not fragile".
>>
>> On Fri, Mar 25, 2016 at 10:44 AM Jacques-Henri Berthemet <
>> jacques-henri.berthe...@genesys.com> wrote:
>>
>> I found this calculator very convenient:
>> http://www.ecyrd.com/cassandracalculator/
>>
>> Regardless of your other DCs you need RF=3 if you write at LOCAL_QUORUM,
>> RF=2 if you write/read at ONE.
>>
>> Obviously using ONE as CL makes your cluster very fragile.
>> --
>> Jacques-Henri Berthemet
>>
>>
>> -Original Message-
>> From: Rakesh Kumar [mailto:rakeshkumar46...@gmail.com]
>> Sent: vendredi 25 mars 2016 18:14
>> To: user@cassandra.apache.org
>> Subject: Re: How many nodes do we require
>>
>> On Fri, Mar 25, 2016 at 11:45 AM, Jack Krupansky
>>  wrote:
>> > It depends on how much data you have. A single node can store a lot of
>> data,
>> > but the more data you have the longer a repair or node replacement will
>> > take. How long can you tolerate for a full repair or node replacement?
>>
>> At this time, for a foreseeable future, size of data will not be
>> significant. So we can safely disregard the above as a decision
>> factor.
>>
>> >
>> > Generally, RF=3 is both sufficient and recommended.
>>
>> Are you telling a SimpleReplication topology with RF=3
>> or NetworkTopology with RF=3.
>>
>>
>> taken from:
>>
>>
>> https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html
>>
>> "
>> Three replicas in each data center: This configuration tolerates
>> either the failure of a one node per replication group at a strong
>> consistency level of LOCAL_QUORUM or multiple node failures per data
>> center using consistency level ONE."
>>
>> In our case, with only 3 nodes in each DC, wouldn't a RF=3 effectively
>> mean ALL.
>>
>> I will state our requirement clearly:
>>
>> If we are going with six nodes (3 in each DC), we should be able to
>> write even with a loss of one DC and loss of one node of the surviving
>> DC. I am open to hearing what compromise we have to do with the reads
>> during the time a DC is down. For us write is critical, more than
>> reads.
>>
>> May be this is not possible with 6 nodes, and requires more.  Pls advise.
>>
>>
>


Re: How many nodes do we require

2016-03-31 Thread Alain RODRIGUEZ
Hi,

Because if you lose a node you have chances to lose some data forever if it
> was not yet replicated.


I think I get your point, but keep in mind that CL ONE (or LOCAL_ONE) will
not prevent the coordinator from sending the data to the 2 other replicas,
it will just wait for the first ack, but all the nodes are supposed to have
the write. There are also the hints system, read repairs and repairs. If
none of this work, the problem is probably bigger than just using CL ONE.

I would advice using QUORUM if a high consistency is important to you. To
spare some space using RF = 2 or to focus at low latency over consistency.
Be carful then, loosing consistency means that returned data might change
depending on the node you hit, if some nodes lose information and entropy
start being high, same query running twice at the same time can return 2
distinct values depending where they read from. Generally, using RF = 3 and
CL = LOCAL_QUORUM for both reads and writes is the best (safest?) option,
but some people happily run with RF = 2 and CL = LOCAL_ONE.

we should be able to run with the loss of one DC and loss of at most one
> node in the surviving DC


You can parameter that on client side on modern clients using the native
protocol and there are some consistency considerations there too.

Hope I am correct about all that :-).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-29 9:56 GMT+02:00 Jacques-Henri Berthemet <
jacques-henri.berthe...@genesys.com>:

> Because if you lose a node you have chances to lose some data forever if
> it was not yet replicated.
>
>
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* vendredi 25 mars 2016 19:37
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: How many nodes do we require
>
>
>
> Why would using CL-ONE make your cluster fragile? This isn't obvious to
> me. It's the most practical setting for high availability, which very much
> says "not fragile".
>
> On Fri, Mar 25, 2016 at 10:44 AM Jacques-Henri Berthemet <
> jacques-henri.berthe...@genesys.com> wrote:
>
> I found this calculator very convenient:
> http://www.ecyrd.com/cassandracalculator/
>
> Regardless of your other DCs you need RF=3 if you write at LOCAL_QUORUM,
> RF=2 if you write/read at ONE.
>
> Obviously using ONE as CL makes your cluster very fragile.
> --
> Jacques-Henri Berthemet
>
>
> -Original Message-
> From: Rakesh Kumar [mailto:rakeshkumar46...@gmail.com]
> Sent: vendredi 25 mars 2016 18:14
> To: user@cassandra.apache.org
> Subject: Re: How many nodes do we require
>
> On Fri, Mar 25, 2016 at 11:45 AM, Jack Krupansky
>  wrote:
> > It depends on how much data you have. A single node can store a lot of
> data,
> > but the more data you have the longer a repair or node replacement will
> > take. How long can you tolerate for a full repair or node replacement?
>
> At this time, for a foreseeable future, size of data will not be
> significant. So we can safely disregard the above as a decision
> factor.
>
> >
> > Generally, RF=3 is both sufficient and recommended.
>
> Are you telling a SimpleReplication topology with RF=3
> or NetworkTopology with RF=3.
>
>
> taken from:
>
>
> https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html
>
> "
> Three replicas in each data center: This configuration tolerates
> either the failure of a one node per replication group at a strong
> consistency level of LOCAL_QUORUM or multiple node failures per data
> center using consistency level ONE."
>
> In our case, with only 3 nodes in each DC, wouldn't a RF=3 effectively
> mean ALL.
>
> I will state our requirement clearly:
>
> If we are going with six nodes (3 in each DC), we should be able to
> write even with a loss of one DC and loss of one node of the surviving
> DC. I am open to hearing what compromise we have to do with the reads
> during the time a DC is down. For us write is critical, more than
> reads.
>
> May be this is not possible with 6 nodes, and requires more.  Pls advise.
>
>


Re: Consistency Level (QUORUM vs LOCAL_QUORUM)

2016-03-31 Thread Alain RODRIGUEZ
 Hi,

If you want the full immediate consistency of a traditional relational
> database, then go with CL=ALL, otherwise, take your pick from the many
> degrees of immediacy that Cassandra offers:


My understanding is using RF 3 and LOCAL_QUORUM for both reads and writes
will provide a strong consistency and a high availability. One node can go
down and also without lowering the consistency. Or RF = 5, Quorum = 3,
allowing 2 nodes node if you need more availability / redundancy (Never saw
that in use so far). I prefer this to ALL, that will produce the cluster to
be partially unavailable any time a node is down (crash or just a node
restart). I also never saw anyone using CL = ALL so far, for this exact
reason I believe.

Writing to 2 nodes out of 3 and reading from 2 nodes out of 3  --> using RF
= 3 / R CL = LOCAL_QUORUM, will give a strong and (almost) immediate
consistency. Remember there is no lock though, so immediate = as soon as
write succeed on enough nodes, 2 in our case).

If you need a strong consistency and are reading the same data in the 2
DCs, consider using EACH_QUORUM as Jack mentioned earlier.

Also, when using 2 datacenter, make sure to pin clients to one DC. Else you
might perform LOCAL_QUORUM operations on any of the 2 DCs, which would
remove the strong consistency.

More information about this last point on step 4 there:
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_dc_to_cluster_t.html

About consistency
https://wiki.apache.org/cassandra/ArchitectureOverview#Consistency.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-28 5:44 GMT+02:00 Jack Krupansky :

> The third choice is EACH_QUORUM which assures QUORUM in each data center
> (all data centers.)
>
> There is no "immediate consistency" per se in Cassandra. Cassandra offers
> "eventual consistency" and "tunable consistency" or the degree of immediate
> consistency, which is the CL that you specify - you specify the degree of
> immediate consistency that you require. If you want the full immediate
> consistency of a traditional relational database, then go with CL=ALL,
> otherwise, take your pick from the many degrees of immediacy that Cassandra
> offers:
>
> http://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlConfigConsistency.html
> .
>
> In short, Cassandra does indeed guarantee the degree of immediate
> consistency that you specify (and presumably want.)
>
>
> -- Jack Krupansky
>
> On Sun, Mar 27, 2016 at 6:36 PM, Harikrishnan A  wrote:
>
>> Hello,
>>
>> I have a question regarding consistency Level settings in a multi Data
>> Center Environment.  What is the preferred CL settings in this scenario for
>> an immediate consistency , QUORUM or LOCAL_QUORUM ?
>>
>> If the replication Factor is set to 3  each ( 2 Data Centers) , the
>> QUORUM ( writes/read) waits for 2 consistent response from nodes across the
>> cluster ( out of total 6 replications) , where as LOCAL_QUORUM expects 2
>> consistent response from 3 replicas within that LOCAL data Center.  If my
>> above understanding is correct, does that mean QUORUM doesn't guarantee
>> Immediate consistency ( in this multi Data Center Scenario) where as
>> LOCAL_QUORUM does guarantee the immediate consistency.
>>
>>
>> Thanks & Regards,
>> Hari
>>
>>
>>
>


Optimizing read and write performance in cassandra through multi-DCs.

2016-03-31 Thread Atul Saroha
Hi,

Would like to understand what's the good approach for handling different
read and write pattens.

We have both read intensive (like direct traffic from web) and write
intensive tasks (continuous bulk/batch upload of data).

For this, we had setup the two datacenter one for read and one for write
task.

Write traffic -> write schema 'W1' with its replication factor 'W1r' ->
read schema R1 with its replication factor 'R1r1'.

Read traffic -> same as above read schema R1 with it replication factor
'R1r2'.

R1 schema is same but read dc dont have write schema. Also Replication is
different for schema R1 in read and write dcs.

This done, so we can independently scale node based on read and write
performance requirement separately.



My question is :
Will this provide concrete performance benefit in cassandra?
Will making a single big cluster of one dc is better approach than this?


Note: bulk upload will write lots of partition keys in a batch. and We have
spark read and write jobs also configured with Write DC.

-
Atul Saroha
*Sr. Software Engineer*
*M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA


Re: Runtime exception during repair job task

2016-03-31 Thread Carlos Alonso
This is probably due to corrupt data or a cassandra upgrade where you
didn't ran upgradesstables

I'd then suggest scrubbing the column family (or upgrading it).

Hope it helps.

Carlos Alonso | Software Engineer | @calonso 

On 31 March 2016 at 12:10,  wrote:

> Hi all,
>
> Recently we tried to repair one of our biggest table, and we keep getting
> hit by errors related to hard link. Here's a stacktrace:
>
> ERROR [RepairJobTask:4] 2016-03-31 05:47:27,268 RepairJob.java:145 - Error
> occurred during snapshot phase
> java.lang.RuntimeException: Could not create snapshot at /10.51.0.7
> at
> org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:48)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_80]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_80]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
> ERROR [AntiEntropyStage:39] 2016-03-31 05:47:27,268
> CassandraDaemon.java:223 - Exception in thread
> Thread[AntiEntropyStage:39,5,main]
> java.lang.RuntimeException: java.lang.RuntimeException: Tried to hard link
> to file that does not exist
> /data/db/ks/table-a24af0002ed511e5b983ade99871dd76/ks-table-ka-50582-Statistics.db
> at
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:141)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> ~[na:1.7.0_80]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> ~[na:1.7.0_80]
> at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_80]
> Caused by: java.lang.RuntimeException: Tried to hard link to file that
> does not exist
> /data/db/ks/table-a24af0002ed511e5b983ade99871dd76/ks-table-ka-50582-Statistics.db
> at
> org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:90)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1799)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:2237)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2319)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> at
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:82)
> ~[apache-cassandra-2.1.5.jar:2.1.5]
> ... 4 common frames omitted
>
> I tried Googling for that particular error and I did not find a definitive
> answer, instead what seems to be recommended is to simply restart the node.
> However, we're getting this error at least once a day and sometimes on
> multiple nodes (we have 7 nodes currently), so it's getting tedious to
> restart cassandra every time.
>
> I saw the issue https://issues.apache.org/jira/browse/CASSANDRA-6433 and
> it suggests it's due to a drop of a keyspace, but we didn't do any drop. So
> I'm not sure that issue really applies, although the error is related.
>
> This issue https://issues.apache.org/jira/browse/CASSANDRA-6716 reports
> the same exception but we didn't do any scrubbing, so I'm not sure it
> applies either.
>
> We're running cassandra 2.1.5 by the way. I don't know if upgrading will
> fix the problems, because I didn't really see anything related to this
> looking in the changelogs.
>
> I'm wondering if getting these exceptions will somehow "block" the repair,
> because it seems the repair is super slow right now (we're talking days
> repairing).
>


Re: Re: Data export with consistency problem

2016-03-31 Thread Alain RODRIGUEZ
>
> If we remove the network cable of one node, import 30 million rows of
> data into that table, and then reconnect the network cable, we export the
> data immediately and we cannot get all the 30 million rows of data.
> But if we manually run ' kill -9 pid' of one node, import 30 million rows
> of data into that table, and then restart the cassandra of that node, we
> export the data immediately and we cat get all the 30 million rows of data.


This is weird, I guess with your configuration, both unplugging the cable
or stopping Cassandra should behave the same way. Yet I still do not
understand what  "export" is. What operation is it? Do you use a "COPY"
query, add a node or a datacenter, do you read everything and write it
away, something else? I need to understand what you are trying to do to
understand what can be going wrong.

then we can import data into C* cluster. Why this happen when there are
> just two normal nodes and our write CL is ALL


What is 'import'?

Are you sure whatever import is that on this specific query you are using
ALL?

Could you show us:

- the export command
- the import command
- the nodetool status before import / export
- the schema for the used keyspace
- anything else you might consider useful

This does not ring any bell with me so far and I don't have enough
information to dig for now.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2016-03-26 4:42 GMT+01:00 xutom :

> Thanks for ur reply!
> I am so sorry for my poor English.
> My keyspace replication is 3 and client read and write CL both are QUORUM.
> If we remove the network cable of one node, import 30 million rows of
> data into that table, and then reconnect the network cable, we export the
> data immediately and we cannot get all the 30 million rows of data.
> But if we manually run ' kill -9 pid' of one node, import 30 million rows
> of data into that table, and then restart the cassandra of that node, we
> export the data immediately and we cat get all the 30 million rows of data.
>
> By the way, we do another test: we install a C* cluster with 3 nodes, we
> turn off the 'hinted handoff', and the keyspace replication is 3, the
> client CL write and read are ALL. Then we manually kill -9 pid of one node,
> and there are just two normal nodes, then we can import data into C*
> cluster. Why this happen when there are just two normal nodes and our write
> CL is ALL, but we can write data into C* cluster.
>
> At 2016-03-25 18:26:55, "Alain RODRIGUEZ"  wrote:
>
> Hi Jerry,
>
> It is all a matter of replication server side and consistency level client
> side.
>
>
> The minimal setup to ensure availability and a strong consistency is RF= 3
> and CL = (LOCAL_)QUORUM.
>
> This way, one node can go down, you still can reach the 2 needed nodes to
> validate your reads & writes --> Availability
> And as there are 3 replica and an operation is successful if it is
> successful at least on 2 replica, at least one node will be write to and
> read from, ensuring a strong and immediate consistency (multiple reads will
> always return the same value, no matter where you read).
>
> Were you using those settings?
>
> reconnect the network cable, we export the data immediately and we cannot
>> all the 30 million rows of data
>
>
> Not sure about 'export'  and 'we cannot all the 30 million rows'. But I
> imagine you were expecting to read the 30 million rows and did not.
>
> Hinted Handoff is an optimisation (anything you can disable is an
> optimisation),  you can't rely on an optimisation like hinted handoff.
>
> Let me know if this answer works for you before digging any further.
>
> Also, I removed "d...@cassandra.apache.org" as this mailing list is used
> by the developers to discuss possible issues there is no issue spotted so
> far, just us trying to understand things, let's not bother those guys
> unless we find an issue :-).
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-03-25 2:35 GMT+01:00 xutom :
>
>> Hi all,
>> I have a C* cluster with five nodes and my cassandra version is 2.1.1
>> and we also enable "Hinted Handoff" . Everything is fine while we use C*
>> cluster to store up to 10 billion rows of data. But now we have a problem.
>> During our test, after we import up to 40 billion rows of data into C*
>> cluster, we manually remove the network cable of one node(eg: there are
>> 5 nodes, and we remove just one network cable of node to simulate minor
>> network problem with C* cluster), then we  create another table and import
>> 30 million into this table. Before we reconnect the network cable of that
>> node, we export the data of the new table, we can export all 30 million
>> rows many times. But after we 

Runtime exception during repair job task

2016-03-31 Thread me
Hi all,
 
Recently we tried to repair one of our biggest table, and we keep
getting hit by errors related to hard link. Here's a stacktrace:
 
ERROR [RepairJobTask:4] 2016-03-31 05:47:27,268 RepairJob.java:145 -
Error occurred during snapshot phase
java.lang.RuntimeException: Could not create snapshot at /10.51.0.7
at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(-
SnapshotTask.java:77) ~[apache-cassandra-2.1.5.jar:2.1.5]
at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHand-
ler.java:48) ~[apache-cassandra-2.1.5.jar:2.1.5]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask-
.java:62) ~[apache-cassandra-2.1.5.jar:2.1.5]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor-
.java:1145) [na:1.7.0_80]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecuto-
r.java:615) [na:1.7.0_80]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
ERROR [AntiEntropyStage:39] 2016-03-31 05:47:27,268
CassandraDaemon.java:223 - Exception in thread
Thread[AntiEntropyStage:39,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Tried to hard
link to file that does not exist 
/data/db/ks/table-a24af0002ed511e5b983ade99871dd76/ks-table-ka-50582-
Statistics.db
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMe-
ssageVerbHandler.java:141) ~[apache-cassandra-2.1.5.jar:2.1.5]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask-
.java:62) ~[apache-cassandra-2.1.5.jar:2.1.5]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor-
.java:1145) ~[na:1.7.0_80]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecuto-
r.java:615) ~[na:1.7.0_80]
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_80]
Caused by: java.lang.RuntimeException: Tried to hard link to file that
does not exist 
/data/db/ks/table-a24af0002ed511e5b983ade99871dd76/ks-table-ka-50582-
Statistics.db
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java-
:90) ~[apache-cassandra-2.1.5.jar:2.1.5]
at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableRea-
der.java:1799) ~[apache-cassandra-2.1.5.jar:2.1.5]
at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(Colum-
nFamilyStore.java:2237) ~[apache-cassandra-2.1.5.jar:2.1.5]
at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore-
.java:2319) ~[apache-cassandra-2.1.5.jar:2.1.5]
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMe-
ssageVerbHandler.java:82) ~[apache-cassandra-2.1.5.jar:2.1.5]
... 4 common frames omitted
 
I tried Googling for that particular error and I did not find a
definitive answer, instead what seems to be recommended is to simply
restart the node. However, we're getting this error at least once a day
and sometimes on multiple nodes (we have 7 nodes currently), so it's
getting tedious to restart cassandra every time.
 
I saw the issue https://issues.apache.org/jira/browse/CASSANDRA-6433
and it suggests it's due to a drop of a keyspace, but we didn't do
any drop. So I'm not sure that issue really applies, although the
error is related.
 
This issue https://issues.apache.org/jira/browse/CASSANDRA-6716 reports
the same exception but we didn't do any scrubbing, so I'm not sure it
applies either.
 
We're running cassandra 2.1.5 by the way. I don't know if upgrading will
fix the problems, because I didn't really see anything related to this
looking in the changelogs.
 
I'm wondering if getting these exceptions will somehow "block" the
repair, because it seems the repair is super slow right now (we're
talking days repairing).