Re:Question about compaction strategy changes

2016-10-21 Thread Zhao Yang
hi Edwards, 

when changibg gc_grace_second, no compaction willbbe triggered.

regards,
jasonstack




Sent from my Mi phoneOn Seth Edwards , Oct 22, 2016 11:37 AM wrote:Hello! We're using TWCS and we notice that if we make changes to the options to the window unit or size, it seems to implicitly start recompacting all sstables. Is this indeed the case and more importantly, does the same happen if we were to adjust the gr_grace_seconds for this table?Thanks!


Question about compaction strategy changes

2016-10-21 Thread Seth Edwards
Hello! We're using TWCS and we notice that if we make changes to the
options to the window unit or size, it seems to implicitly start
recompacting all sstables. Is this indeed the case and more importantly,
does the same happen if we were to adjust the gr_grace_seconds for this
table?


Thanks!


Re: Cannot restrict clustering columns by IN relations when a collection is selected by the query

2016-10-21 Thread DuyHai Doan
So the commit on this restriction dates back to 2.2.0 (CASSANDRA-7981).

Maybe Benjamin Lerer can shed some light on it.

On Fri, Oct 21, 2016 at 11:05 PM, Jeff Carpenter <
jeff.carpen...@choicehotels.com> wrote:

> Hello
>
> Consider the following schema:
>
> CREATE TABLE rates_by_code (
>   hotel_id text,
>   rate_code text,
>   rates set,
>   description text,
>   PRIMARY KEY ((hotel_id), rate_code)
> );
>
> When executing the query:
>
> select rates from rates_by_code where hotel_id='AZ123' and rate_code IN
> ('ABC', 'DEF', 'GHI');
>
> I receive the response message:
>
> Cannot restrict clustering columns by IN relations when a collection is
> selected by the query.
>
> If I select a non-collection column such as "description", no error occurs.
>
> Why does this restriction exist? Is this a restriction that is still
> necessary given the new storage engine? (I have verified this on both 2.2.5
> and 3.0.9.)
>
> I looked for a Jira issue related to this topic, but nothing obvious
> popped up. I'd be happy to create one, though.
>
> Thanks
> Jeff Carpenter
>
>
>
>


Cannot restrict clustering columns by IN relations when a collection is selected by the query

2016-10-21 Thread Jeff Carpenter
Hello

Consider the following schema:

CREATE TABLE rates_by_code (
  hotel_id text,
  rate_code text,
  rates set,
  description text,
  PRIMARY KEY ((hotel_id), rate_code)
);

When executing the query:

select rates from rates_by_code where hotel_id='AZ123' and rate_code IN ('ABC', 
'DEF', 'GHI');

I receive the response message:

Cannot restrict clustering columns by IN relations when a collection is 
selected by the query.

If I select a non-collection column such as "description", no error occurs.

Why does this restriction exist? Is this a restriction that is still necessary 
given the new storage engine? (I have verified this on both 2.2.5 and 3.0.9.)

I looked for a Jira issue related to this topic, but nothing obvious popped up. 
I'd be happy to create one, though.

Thanks
Jeff Carpenter



<>

Re: Does anyone store larger values in Cassandra E.g. 500 KB?

2016-10-21 Thread jason zhao yang
1. usually before storing object, serialization is needed, so we can know
the size.
2. add "chunk id" as last clustering key.

Vikas Jaiman 于2016年10月21日周五 下午11:46写道:

> Thanks for your answer but I am just curious about:
>
> i)How do you identify the size of the object which you are going to chunk?
>
> ii) While reading or updating how it is going to read all those chunks?
>
> Vikas
>
> On Thu, Oct 20, 2016 at 9:25 PM, Justin Cameron 
> wrote:
>
> You can, but it is not really very efficient or cost-effective. You may
> encounter issues with streaming, repairs and compaction if you have very
> large blobs (100MB+), so try to keep them under 10MB if possible.
>
> I'd suggest storing blobs in something like Amazon S3 and keeping just the
> bucket name & blob id in Cassandra.
>
> On Thu, 20 Oct 2016 at 12:03 Vikas Jaiman 
> wrote:
>
> Hi,
>
> Normally people would like to store smaller values in Cassandra. Is there
> anyone using it to store for larger values (e.g 500KB or more) and if so
> what are the issues you are facing . I Would like to know the tweaks also
> which you are considering.
>
> Thanks,
> Vikas
>
> --
>
> Justin Cameron
>
> Senior Software Engineer | Instaclustr
>
>
>
>
> This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
> Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>
>
>
> --
>


Re: Does anyone store larger values in Cassandra E.g. 500 KB?

2016-10-21 Thread Vikas Jaiman
Thanks for your answer but I am just curious about:

i)How do you identify the size of the object which you are going to chunk?

ii) While reading or updating how it is going to read all those chunks?

Vikas

On Thu, Oct 20, 2016 at 9:25 PM, Justin Cameron 
wrote:

> You can, but it is not really very efficient or cost-effective. You may
> encounter issues with streaming, repairs and compaction if you have very
> large blobs (100MB+), so try to keep them under 10MB if possible.
>
> I'd suggest storing blobs in something like Amazon S3 and keeping just the
> bucket name & blob id in Cassandra.
>
> On Thu, 20 Oct 2016 at 12:03 Vikas Jaiman 
> wrote:
>
>> Hi,
>>
>> Normally people would like to store smaller values in Cassandra. Is there
>> anyone using it to store for larger values (e.g 500KB or more) and if so
>> what are the issues you are facing . I Would like to know the tweaks also
>> which you are considering.
>>
>> Thanks,
>> Vikas
>>
> --
>
> Justin Cameron
>
> Senior Software Engineer | Instaclustr
>
>
>
>
> This email has been sent on behalf of Instaclustr Pty Ltd (Australia) and
> Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
>


--


Re: Cluster Maintenance Mishap

2016-10-21 Thread Branton Davis
Thanks.  Unfortunately, we lost our system logs during all of this
(had normal logs, but not system) due to an unrelated issue :/

Anyhow, as far as I can tell, we're doing okay.

On Thu, Oct 20, 2016 at 11:18 PM, Jeremiah D Jordan <
jeremiah.jor...@gmail.com> wrote:

> The easiest way to figure out what happened is to examine the system log.
> It will tell you what happened.  But I’m pretty sure your nodes got new
> tokens during that time.
>
> If you want to get back the data inserted during the 2 hours you could use
> sstableloader to send all the data from the 
> /var/data/cassandra_new/cassandra/*
> folders back into the cluster if you still have it.
>
> -Jeremiah
>
>
>
> On Oct 20, 2016, at 3:58 PM, Branton Davis 
> wrote:
>
> Howdy folks.  I asked some about this in IRC yesterday, but we're looking
> to hopefully confirm a couple of things for our sanity.
>
> Yesterday, I was performing an operation on a 21-node cluster (vnodes,
> replication factor 3, NetworkTopologyStrategy, and the nodes are balanced
> across 3 AZs on AWS EC2).  The plan was to swap each node's existing 1TB
> volume (where all cassandra data, including the commitlog, is stored) with
> a 2TB volume.  The plan for each node (one at a time) was basically:
>
>- rsync while the node is live (repeated until there were only minor
>differences from new data)
>- stop cassandra on the node
>- rsync again
>- replace the old volume with the new
>- start cassandra
>
> However, there was a bug in the rsync command.  Instead of copying the
> contents of /var/data/cassandra to /var/data/cassandra_new, it copied it to
> /var/data/cassandra_new/cassandra.  So, when cassandra was started after
> the volume swap, there was some behavior that was similar to bootstrapping
> a new node (data started streaming in from other nodes).  But there
> was also some behavior that was similar to a node replacement (nodetool
> status showed the same IP address, but a different host ID).  This
> happened with 3 nodes (one from each AZ).  The nodes had received 1.4GB,
> 1.2GB, and 0.6GB of data (whereas the normal load for a node is around
> 500-600GB).
>
> The cluster was in this state for about 2 hours, at which point cassandra
> was stopped on them.  Later, I moved the data from the original volumes
> back into place (so, should be the original state before the operation) and
> started cassandra back up.
>
> Finally, the questions.  We've accepted the potential loss of new data
> within the two hours, but our primary concern now is what was happening
> with the bootstrapping nodes.  Would they have taken on the token ranges
> of the original nodes or acted like new nodes and got new token ranges?  If
> the latter, is it possible that any data moved from the healthy nodes to
> the "new" nodes or would restarting them with the original data (and
> repairing) put the cluster's token ranges back into a normal state?
>
> Hopefully that was all clear.  Thanks in advance for any info!
>
>
>


Re: Cluster Maintenance Mishap

2016-10-21 Thread Branton Davis
It mostly seems so.  The thing that bugs me is that some things acted
like they weren't joining as a normal new node.  For example, I forgot
to mention until I read your comment, that the instances showed as UN
(up, normal) instead of UJ (up, joining) while they were
apparently bootstrapping.

Thanks for the assurance.  I'm thinking (hoping) that we're good.

On Thu, Oct 20, 2016 at 11:24 PM, kurt Greaves  wrote:

>
> On 20 October 2016 at 20:58, Branton Davis 
> wrote:
>
>> Would they have taken on the token ranges of the original nodes or acted
>> like new nodes and got new token ranges?  If the latter, is it possible
>> that any data moved from the healthy nodes to the "new" nodes or
>> would restarting them with the original data (and repairing) put
>> the cluster's token ranges back into a normal state?
>
>
> It sounds like you stopped them before they completed joining. So you
> should have nothing to worry about. If not, you will see them marked as DN
> from other nodes in the cluster. If you did, they wouldn't have assumed the
> token ranges and you shouldn't have any issues.
>
> You can just copy the original data back (including system tables) and
> they should assume their own ranges again, and then you can repair to fix
> any missing replicas.
>
> Kurt Greaves
> k...@instaclustr.com
> www.instaclustr.com
>


Re: Is SASI index in Cassandra efficient for high cardinality columns?

2016-10-21 Thread Kant Kodali
Why Secondary index cannot be broken down into token ranges like primary
index at least for exact matches? That way dont need to scan the whole
cluster atleast for exact matches. I understand if it is a substring search
then there will 2^n substrings which equates to 2^n hashes/tokens which can
be a lot!

On Sat, Oct 15, 2016 at 4:35 AM, DuyHai Doan  wrote:

> If each indexed value has very few matching rows, then querying using SASI
> (or any impl of secondary index) may scan the whole cluster.
>
> This is because the index are "distributed" e.g. the indexed values stay
> on the same nodes as the base data. And even SASI with its own
> data-structure will not help much here.
>
> One should understand that the 2nd index query has to deal with 2 layers:
>
> 1) The cluster layer, which is common for any impl of 2nd index. Read my
> blog post here: http://www.planetcassandra.org/blog/
> cassandra-native-secondary-index-deep-dive/
>
> 2) The local read path, which depends on the impl of 2nd index. Some are
> using Lucene library like Stratio impl, some rolls in its own data
> structures like SASI
>
> If you have a 1-to-1 relationship between the index value and the matching
> row (or 1-to-a few), I would recommend using materialized views instead:
>
> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-
> the-full-text-search-ride-voxxed-daybelgrade-2016/25
>
> Materialized views guarantee that for each search indexed value, you only
> hit a single node (or N replicas depending on the used consistency level)
>
> However, materialized views have their own drawbacks (weeker consistency
> guarantee) and you can't use range queries (<,  >, ≤, ≥) or full text
> search on the indexed value
>
>
>
>
>
> On Sat, Oct 15, 2016 at 11:55 AM, Kant Kodali  wrote:
>
>> Well I went with the definition from wikipedia and that definition rules
>> out #1 so it is #2 and it is just one matching row in my case.
>>
>>
>>
>> On Sat, Oct 15, 2016 at 2:40 AM, DuyHai Doan 
>> wrote:
>>
>> > Define precisely what you mean by "high cardinality columns". Do you
>> mean:
>> >
>> > 1) a single indexed value is present in a lot of rows
>> > 2) a single indexed value has only a few (if not just one) matching row
>> >
>> >
>> > On Sat, Oct 15, 2016 at 8:37 AM, Kant Kodali  wrote:
>> >
>> >> I understand Secondary Indexes in general are inefficient on high
>> >> cardinality columns but since SASI is built from scratch I wonder if
>> the
>> >> same argument applies there? If not, Why? Because I believe primary
>> keys in
>> >> Cassandra are indeed indexed and since Primary key is supposed to be
>> the
>> >> column with highest cardinality why not do the same for secondary
>> indexes?
>> >>
>> >
>> >
>>
>
>


Re: failure node rejoin

2016-10-21 Thread Ben Slater
Just to confirm, are you saying:
a) after operation 2, you select all and get 1000 rows
b) after operation 3 (which only does updates and read) you select and only
get 953 rows?

If so, that would be very unexpected. If you run your tests without killing
nodes do you get the expected (1,000) rows?

Cheers
Ben

On Fri, 21 Oct 2016 at 17:00 Yuji Ito  wrote:

> > Are you certain your tests don’t generate any overlapping inserts (by
> PK)?
>
> Yes. The operation 2) also checks the number of rows just after all
> insertions.
>
>
> On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater 
> wrote:
>
> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito  wrote:
>
> thanks Ben,
>
> > 1) At what stage did you have (or expect to have) 1000 rows (and have
> the mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
>
> after operation 3), at operation 4) which reads all rows by cqlsh with
> CL.SERIAL
>
> > 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
> - create keyspace testkeyspace WITH REPLICATION =
> {'class':'SimpleStrategy','replication_factor':3};
> - consistency level is SERIAL
>
>
> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater 
> wrote:
>
>
> A couple of questions:
> 1) At what stage did you have (or expect to have) 1000 rows (and have the
> mismatch between actual and expected) - at that end of operation (2) or
> after operation (3)?
> 2) What replication factor and replication strategy is used by the test
> keyspace? What consistency level is used by your operations?
>
>
> Cheers
> Ben
>
> On Fri, 21 Oct 2016 at 13:57 Yuji Ito  wrote:
>
> Thanks Ben,
>
> I tried to run a rebuild and repair after the failure node rejoined the
> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
> The failure node could rejoined and I could read all rows successfully.
> (Sometimes a repair failed because the node cannot access other node. If
> it failed, I retried a repair)
>
> But some rows were lost after my destructive test repeated (after about
> 5-6 hours).
> After the test inserted 1000 rows, there were only 953 rows at the end of
> the test.
>
> My destructive test:
> - each C* node is killed & restarted at the random interval (within about
> 5 min) throughout this test
> 1) truncate all tables
> 2) insert initial rows (check if all rows are inserted successfully)
> 3) request a lot of read/write to random rows for about 30min
> 4) check all rows
> If operation 1), 2) or 4) fail due to C* failure, the test retry the
> operation.
>
> Does anyone have the similar problem?
> What causes data lost?
> Does the test need any operation when C* node is restarted? (Currently, I
> just restarted C* process)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater 
> wrote:
>
> OK, that’s a bit more unexpected (to me at least) but I think the solution
> of running a rebuild or repair still applies.
>
> On Tue, 18 Oct 2016 at 15:45 Yuji Ito  wrote:
>
> Thanks Ben, Jeff
>
> Sorry that my explanation confused you.
>
> Only node1 is the seed node.
> Node2 whose C* data is deleted is NOT a seed.
>
> I restarted the failure node(node2) after restarting the seed node(node1).
> The restarting node2 succeeded without the exception.
> (I couldn't restart node2 before restarting node1 as expected.)
>
> Regards,
>
>
> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa 
> wrote:
>
> The unstated "problem" here is that node1 is a seed, which implies
> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
> setup to start without bootstrapping).
>
> That means once the data dir is wiped, it's going to start again without a
> bootstrap, and make a single node cluster or join an existing cluster if
> the seed list is valid
>
>
>
> --
> Jeff Jirsa
>
>
> On Oct 17, 2016, at 8:51 PM, Ben Slater 
> wrote:
>
> OK, sorry - I think understand what you are asking now.
>
> However, I’m still a little confused by your description. I think your
> scenario is:
> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
> 2) Delete all data from Node A
> 3) Restart Node A
> 4) Restart Node B,C
>
> Is this correct?
>
> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node A
> starts succesfully as there are no running nodes to tell it via gossip that
> it shouldn’t start up without the “replaces” flag.
>
> I think that right way to recover in this scenario is to run a nodetool
> rebuild on Node A after the other two nodes are running. You could
> 

Re: failure node rejoin

2016-10-21 Thread Yuji Ito
> Are you certain your tests don’t generate any overlapping inserts (by PK)?

Yes. The operation 2) also checks the number of rows just after all
insertions.


On Fri, Oct 21, 2016 at 2:51 PM, Ben Slater 
wrote:

> OK. Are you certain your tests don’t generate any overlapping inserts (by
> PK)? Cassandra basically treats any inserts with the same primary key as
> updates (so 1000 insert operations may not necessarily result in 1000 rows
> in the DB).
>
> On Fri, 21 Oct 2016 at 16:30 Yuji Ito  wrote:
>
>> thanks Ben,
>>
>> > 1) At what stage did you have (or expect to have) 1000 rows (and have
>> the mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>>
>> after operation 3), at operation 4) which reads all rows by cqlsh with
>> CL.SERIAL
>>
>> > 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>> - create keyspace testkeyspace WITH REPLICATION =
>> {'class':'SimpleStrategy','replication_factor':3};
>> - consistency level is SERIAL
>>
>>
>> On Fri, Oct 21, 2016 at 12:04 PM, Ben Slater 
>> wrote:
>>
>>
>> A couple of questions:
>> 1) At what stage did you have (or expect to have) 1000 rows (and have the
>> mismatch between actual and expected) - at that end of operation (2) or
>> after operation (3)?
>> 2) What replication factor and replication strategy is used by the test
>> keyspace? What consistency level is used by your operations?
>>
>>
>> Cheers
>> Ben
>>
>> On Fri, 21 Oct 2016 at 13:57 Yuji Ito  wrote:
>>
>> Thanks Ben,
>>
>> I tried to run a rebuild and repair after the failure node rejoined the
>> cluster as a "new" node with -Dcassandra.replace_address_first_boot.
>> The failure node could rejoined and I could read all rows successfully.
>> (Sometimes a repair failed because the node cannot access other node. If
>> it failed, I retried a repair)
>>
>> But some rows were lost after my destructive test repeated (after about
>> 5-6 hours).
>> After the test inserted 1000 rows, there were only 953 rows at the end of
>> the test.
>>
>> My destructive test:
>> - each C* node is killed & restarted at the random interval (within about
>> 5 min) throughout this test
>> 1) truncate all tables
>> 2) insert initial rows (check if all rows are inserted successfully)
>> 3) request a lot of read/write to random rows for about 30min
>> 4) check all rows
>> If operation 1), 2) or 4) fail due to C* failure, the test retry the
>> operation.
>>
>> Does anyone have the similar problem?
>> What causes data lost?
>> Does the test need any operation when C* node is restarted? (Currently, I
>> just restarted C* process)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 2:18 PM, Ben Slater 
>> wrote:
>>
>> OK, that’s a bit more unexpected (to me at least) but I think the
>> solution of running a rebuild or repair still applies.
>>
>> On Tue, 18 Oct 2016 at 15:45 Yuji Ito  wrote:
>>
>> Thanks Ben, Jeff
>>
>> Sorry that my explanation confused you.
>>
>> Only node1 is the seed node.
>> Node2 whose C* data is deleted is NOT a seed.
>>
>> I restarted the failure node(node2) after restarting the seed node(node1).
>> The restarting node2 succeeded without the exception.
>> (I couldn't restart node2 before restarting node1 as expected.)
>>
>> Regards,
>>
>>
>> On Tue, Oct 18, 2016 at 1:06 PM, Jeff Jirsa 
>> wrote:
>>
>> The unstated "problem" here is that node1 is a seed, which implies
>> auto_bootstrap=false (can't bootstrap a seed, so it was almost certainly
>> setup to start without bootstrapping).
>>
>> That means once the data dir is wiped, it's going to start again without
>> a bootstrap, and make a single node cluster or join an existing cluster if
>> the seed list is valid
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Oct 17, 2016, at 8:51 PM, Ben Slater 
>> wrote:
>>
>> OK, sorry - I think understand what you are asking now.
>>
>> However, I’m still a little confused by your description. I think your
>> scenario is:
>> 1) Stop C* on all nodes in a cluster (Nodes A,B,C)
>> 2) Delete all data from Node A
>> 3) Restart Node A
>> 4) Restart Node B,C
>>
>> Is this correct?
>>
>> If so, this isn’t a scenario I’ve tested/seen but I’m not surprised Node
>> A starts succesfully as there are no running nodes to tell it via gossip
>> that it shouldn’t start up without the “replaces” flag.
>>
>> I think that right way to recover in this scenario is to run a nodetool
>> rebuild on Node A after the other two nodes are running. You could
>> theoretically also run a repair (which would be good practice after a weird
>> failure scenario like this) but rebuild will probably be quicker given you
>> know all the data needs to be re-streamed.
>>
>> Cheers
>> Ben
>>
>> On Tue, 18 Oct 2016 at 14:03 Yuji Ito