Re: Inconsistent count(*) and distinct results from Cassandra

2015-03-12 Thread DuyHai Doan
First idea to eliminate any issue with regards to staled data: issue the
same count query with RF=QUORUM and check whether there are still
inconsistencies

On Tue, Mar 10, 2015 at 9:13 AM, Rumph, Frens Jan m...@frensjan.nl wrote:

 Hi Jens, Mikhail, Daemeon,

 Thanks for your replies. Sorry for my reply being late ... mails from the
 user-list were moved to the wrong inbox on my side.

 I'm in a development environment and thus using replication factor = 1 and
 consistency = ONE with three nodes. So the 'results from different nodes
 between queries' hypothesis seems unlikely to me. I would expect a timeout
 if some node wouldn't be able to answer.

 I tried tracing, but I couldn't really make any of it.

 For example I performed two select distinct ... from ... queries: Traces
 for both of them contained more than one line like 'Submitting range
 requests on ... ranges ...' and 'Submitted ... concurrent range requests
 covering ... ranges'. These lines occur with varying numbers, e.g. :

 Submitting range requests on 593 ranges with a concurrency of 75 (1.35
 rows per range expected)
 Submitting range requests on 769 ranges with a concurrency of 75 (1.35
 rows per range expected)


 Also when looking at the lines like 'Executing seq scan across ...
 sstables for ...' I saw that in one case which yielded way less partition
 keys that only the tokens from -922337203685477  to -594461978511041000
 were included. In a case which yielded much more partition keys, the entire
 token range did seem to be queried.

 To reiterate my initial questions: is this behavior to be expected? Am I
 doing something wrong? Is there a workaround?

 Best regards,
 Frens Jan

 On 4 March 2015 at 22:59, daemeon reiydelle daeme...@gmail.com wrote:

 What is the replication? Could you be serving stale data from a node that
 was not properly replicated (hints timeout exceeded by a node being down?)



 On Wed, Mar 4, 2015 at 11:03 AM, Jens Rantil jens.ran...@tink.se wrote:

 Frens,

 What consistency are you querying with? Could be you are simply
 receiving result from different nodes each time.

 Jens

 –
 Skickat från Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 4, 2015 at 7:08 PM, Mikhail Strebkov streb...@gmail.com
 wrote:

 We have observed the same issue in our production Cassandra cluster (5
 nodes in one DC). We use Cassandra 2.1.3 (I joined the list too late to
 realize we shouldn’t user 2.1.x yet) on Amazon machines (created from
 community AMI).

 In addition to count variations with 5 to 10% we observe variations for
 the query “select * from table1 where time  '$fromDate' and time 
 '$toDate' allow filtering” results. We iterated through the results
 multiple times using official Java driver. We used that query for a huge
 data migration and were unpleasantly surprised that it is unreliable. In
 our case “nodetool repair” didn’t fix the issue.

 So I echo Frens questions.

 Thanks,
 Mikhail




 On Wed, Mar 4, 2015 at 3:55 AM, Rumph, Frens Jan m...@frensjan.nl
 wrote:

 Hi,

 Is it to be expected that select count(*) from ... and select distinct
 partition-key-columns from ... to yield inconsistent results between
 executions even though the table at hand isn't written to?

 I have a table in a keyspace with replication_factor = 1 which is
 something like:

  CREATE TABLE tbl (
 id frozenid_type,
 bucket bigint,
 offset int,
 value double,
 PRIMARY KEY ((id, bucket), offset)
 )

 The frozen udt is:

  CREATE TYPE id_type (
 tags maptext, text
 );

 When I do select count(*) from tbl several times the actual count
 varies with 5 to 10%. Also when performing select distinct id, bucket from
 tbl the results aren't consistent over several query executions. The table
 is not being written to at the time I performed the queries.

 Is this to be expected? Or is this a bug? Is there a alternative
 method / workaround?

 I'm using cqlsh 5.0.1 with Cassandra 2.1.2 on 64bit fedora 21 with
 Oracle Java 1.8.0_31.

 Thanks in advance,
 Frens Jan








Re: Unable to overwrite some rows

2015-03-12 Thread Guðmundur Örn Jóhannsson
That's it. The clock on one of the nodes was way off. Thanks!!

--
regards,
Gudmundur Johannsson


On Wed, Mar 11, 2015 at 3:42 PM, Roland Etzenhammer 
r.etzenham...@t-online.de wrote:

 Hi,

 I think that your clocks are not in sync. Do you have ntp on all your
 nodes up and running with low offset? If not, setup ntp as first probable
 solution. Cassandra relies on accurate clocks on all cluster nodes for it's
 (internal) timestamps.

 Do you see any error while writing? Or just while reading?

 Cheers,
 Roland




Re: Adding a Cassandra node using OpsCenter

2015-03-12 Thread Ajay
Is there a separate forum for Opscenter?

Thanks
Ajay
On 11-Mar-2015 4:16 pm, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 While adding a Cassandra node using OpsCenter (which is recommended), the
 versions of Cassandra (Datastax community edition) shows only 2.0.9 and not
 later versions in 2.0.x. Is there a reason behind it? 2.0.9 is recommended
 than 2.0.11?

 Thanks
 Ajay



Re: CQL 3.x Update ...USING TIMESTAMP...

2015-03-12 Thread Eric Stevens
 It's possible, but you'll end up with problems when attempting to
overwrite or delete entries

I'm wondering if you can elucidate on that a little bit, do you just mean
that it's easy to forget to always set your timestamp correctly, and if you
goof it up, it makes it difficult to recover from (i.e. you issue a delete
with system timestamp instead of document version, and that's way larger
than your document version would ever be, so you can never write that
document again)?  Or is there some bug in write timestamps that can cause
the wrong entry to win the write contention?

We're looking at doing something similar to keep a live max value column in
a given table, our setup is as follows:

CREATE TABLE a (
  id whatever,
  time timestamp,
  max_b_foo int,
  PRIMARY KEY (id)
);
CREATE TABLE b (
  b_id whatever,
  a_id whatever,
  a_timestamp timestamp,
  foo int,
  PRIMARY KEY (a_id, b_id)
);

The idea being that there's a one-to-many relationship between *a* and *b*.
We want *a* to know what the maximum value is in *b* for field *foo* so we
can avoid reading *all* *b* when we want to resolve *a*. You can see that
we can't just use *b*'s clustering key to resolve that with LIMIT 1; also
this is for DSE Solr, which wouldn't be able to query a by max b.foo
anyway.  So when we write to *b*, we also write to *a* with something like

UPDATE a USING TIMESTAMP ${b.a_timestamp.toMicros + b.foo} SET max_b_foo =
${b.foo} WHERE id = ${b.a_id}

Assuming that we don't run afoul of related antipatterns such as repeatedly
overwriting the same value indefinitely, this strikes me as sound if
unorthodox practice, as long as conflict resolution in Cassandra isn't
broken in some subtle way.  We also designed this to be safe from getting
write timestamps greatly out of sync with clock time so that
non-timestamped operations (especially delete) if done accidentally will
still have a reasonable chance of having the expected results.

So while it may not be the intended use case for write timestamps, and
there are definitely gotchas if you are not careful or misunderstand the
consequences, as far as I can see the logic behind it is sound but does
rely on correct conflict resolution in Cassandra.  I'm curious if I'm
missing or misunderstanding something important.

On Wed, Mar 11, 2015 at 4:11 PM, Tyler Hobbs ty...@datastax.com wrote:

 Don't use the version as your timestamp.  It's possible, but you'll end up
 with problems when attempting to overwrite or delete entries.

 Instead, make the version part of the primary key:

 CREATE TABLE document_store (document_id bigint, version int, document
 text, PRIMARY KEY (document_id, version)) WITH CLUSTERING ORDER BY (version
 desc)

 That way you don't have to worry about overwriting higher versions with a
 lower one, and to read the latest version, you only have to do:

 SELECT * FROM document_store WHERE document_id = ? LIMIT 1;

 Another option is to use lightweight transactions (i.e. UPDATE ... SET
 docuement = ?, version = ? WHERE document_id = ? IF version  ?), but
 that's going to make writes much more expensive.

 On Wed, Mar 11, 2015 at 12:45 AM, Sachin Nikam skni...@gmail.com wrote:

 I am planning to use the Update...USING TIMESTAMP... statement to make
 sure that I do not overwrite fresh data with stale data while having to
 avoid doing at least LOCAL_QUORUM writes.

 Here is my table structure.

 Table=DocumentStore
 DocumentID (primaryKey, bigint)
 Document(text)
 Version(int)

 If the service receives 2 write requests with Version=1 and Version=2,
 regardless of the order of arrival, the business requirement is that we end
 up with Version=2 in the database.

 Can I use the following CQL Statement?

 Update DocumentStore using versionValue
 SET  Document=documentValue,
 Version=versionValue
 where DocumentID=documentIDValue;

 Has anybody used something like this? If so was the behavior as expected?

 Regards
 Sachin




 --
 Tyler Hobbs
 DataStax http://datastax.com/



Re: Steps to do after schema changes

2015-03-12 Thread Mark Reddy
It's always good to run nodetool describecluster after a schema change,
this will show you all the nodes in your cluster and what schema version
they have. If they have different versions you have a schema disagreement
and should follow this guide to resolution:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_handle_schema_disagree_t.html

Regards,
Mark

On 12 March 2015 at 05:47, Phil Yang ud1...@gmail.com wrote:

 Usually, you have nothing to do. Changes will be synced to every nodes
 automatically.

 2015-03-12 13:21 GMT+08:00 Ajay ajay.ga...@gmail.com:

 Hi,

 Are there any steps to do (like nodetool or restart node) or any
 precautions after schema changes are done in a column family say adding a
 new column or modifying any table properties?

 Thanks
 Ajay




 --
 Thanks,
 Phil Yang




Re: Stable cassandra build for production usage

2015-03-12 Thread Ajay
Hi,

We did our research using 2.0.11 version. While preparing for the
production deployment, found out the following issues:

1) 2.0.12 has nodetool cleanup issue -
https://issues.apache.org/jira/browse/CASSANDRA-8718
2) 2.0.11 has nodetool issue -
https://issues.apache.org/jira/browse/CASSANDRA-8548
3) OpsCenter 5.1.0 supports only - 2.0.9 and not later 2.0.x -
https://issues.apache.org/jira/browse/CASSANDRA-8072
4) 2.0.9 has schema refresh issue -
https://issues.apache.org/jira/browse/CASSANDRA-7734

Please suggest what is the best option in this for production deployment in
EC2 given that we are deploying Cassandra cluster for the 1st time (so
likely that we add more data centers/nodes and schema changes in the
initial few months)

Thanks
Ajay

On Thu, Jan 1, 2015 at 9:49 PM, Neha Trivedi nehajtriv...@gmail.com wrote:

 Use 2.0.11 for production

 On Wed, Dec 31, 2014 at 11:50 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Wed, Dec 31, 2014 at 8:38 AM, Ajay ajay.ga...@gmail.com wrote:

 For my research and learning I am using Cassandra 2.1.2. But I see
 couple of mail threads going on issues in 2.1.2. So what is the stable or
 popular build for production in Cassandra 2.x series.

 https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

 =Rob





Re: Adding a Cassandra node using OpsCenter

2015-03-12 Thread Nick Bailey
There isn't an OpsCenter specific mailing list no.

To answer your question, the reason OpsCenter provisioning doesn't support
2.0.10 and 2.0.11 is due to
https://issues.apache.org/jira/browse/CASSANDRA-8072.

That bug unfortunately prevents OpsCenter provisioning from working
correctly, but isn't serious outside of provisioning. OpsCenter may be able
to come up with a workaround but at the moment those versions are
unsupported. Sorry for inconvenience.

-Nick

On Thu, Mar 12, 2015 at 9:18 AM, Ajay ajay.ga...@gmail.com wrote:

 Is there a separate forum for Opscenter?

 Thanks
 Ajay
 On 11-Mar-2015 4:16 pm, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 While adding a Cassandra node using OpsCenter (which is recommended), the
 versions of Cassandra (Datastax community edition) shows only 2.0.9 and not
 later versions in 2.0.x. Is there a reason behind it? 2.0.9 is recommended
 than 2.0.11?

 Thanks
 Ajay




Re: Stable cassandra build for production usage

2015-03-12 Thread Robert Coli
On Thu, Mar 12, 2015 at 10:50 AM, Ajay ajay.ga...@gmail.com wrote:

 Please suggest what is the best option in this for production deployment
 in EC2 given that we are deploying Cassandra cluster for the 1st time (so
 likely that we add more data centers/nodes and schema changes in the
 initial few months)


Voting for 2.0.13 is in process. I'd wait for that. But I don't need
OpsCenter.

=Rob


Re: CQL 3.x Update ...USING TIMESTAMP...

2015-03-12 Thread Jonathan Haddad
In most datacenters you're going to see significant variance in your server
times.  Likely  20ms between servers in the same rack.  Even google, using
atomic clocks, has 1-7ms variance.  [1]

I would +1 Tyler's advice here, as using the clocks is only valid if clocks
are perfectly sync'ed, which they are not, and likely never will be in our
lifetime.

[1] http://queue.acm.org/detail.cfm?id=2745385


On Thu, Mar 12, 2015 at 7:04 AM Eric Stevens migh...@gmail.com wrote:

  It's possible, but you'll end up with problems when attempting to
 overwrite or delete entries

 I'm wondering if you can elucidate on that a little bit, do you just mean
 that it's easy to forget to always set your timestamp correctly, and if you
 goof it up, it makes it difficult to recover from (i.e. you issue a delete
 with system timestamp instead of document version, and that's way larger
 than your document version would ever be, so you can never write that
 document again)?  Or is there some bug in write timestamps that can cause
 the wrong entry to win the write contention?

 We're looking at doing something similar to keep a live max value column
 in a given table, our setup is as follows:

 CREATE TABLE a (
   id whatever,
   time timestamp,
   max_b_foo int,
   PRIMARY KEY (id)
 );
 CREATE TABLE b (
   b_id whatever,
   a_id whatever,
   a_timestamp timestamp,
   foo int,
   PRIMARY KEY (a_id, b_id)
 );

 The idea being that there's a one-to-many relationship between *a* and *b*.
 We want *a* to know what the maximum value is in *b* for field *foo* so
 we can avoid reading *all* *b* when we want to resolve *a*. You can see
 that we can't just use *b*'s clustering key to resolve that with LIMIT 1;
 also this is for DSE Solr, which wouldn't be able to query a by max b.foo
 anyway.  So when we write to *b*, we also write to *a* with something
 like

 UPDATE a USING TIMESTAMP ${b.a_timestamp.toMicros + b.foo} SET max_b_foo =
 ${b.foo} WHERE id = ${b.a_id}

 Assuming that we don't run afoul of related antipatterns such as
 repeatedly overwriting the same value indefinitely, this strikes me as
 sound if unorthodox practice, as long as conflict resolution in Cassandra
 isn't broken in some subtle way.  We also designed this to be safe from
 getting write timestamps greatly out of sync with clock time so that
 non-timestamped operations (especially delete) if done accidentally will
 still have a reasonable chance of having the expected results.

 So while it may not be the intended use case for write timestamps, and
 there are definitely gotchas if you are not careful or misunderstand the
 consequences, as far as I can see the logic behind it is sound but does
 rely on correct conflict resolution in Cassandra.  I'm curious if I'm
 missing or misunderstanding something important.

 On Wed, Mar 11, 2015 at 4:11 PM, Tyler Hobbs ty...@datastax.com wrote:

 Don't use the version as your timestamp.  It's possible, but you'll end
 up with problems when attempting to overwrite or delete entries.

 Instead, make the version part of the primary key:

 CREATE TABLE document_store (document_id bigint, version int, document
 text, PRIMARY KEY (document_id, version)) WITH CLUSTERING ORDER BY (version
 desc)

 That way you don't have to worry about overwriting higher versions with a
 lower one, and to read the latest version, you only have to do:

 SELECT * FROM document_store WHERE document_id = ? LIMIT 1;

 Another option is to use lightweight transactions (i.e. UPDATE ... SET
 docuement = ?, version = ? WHERE document_id = ? IF version  ?), but
 that's going to make writes much more expensive.

 On Wed, Mar 11, 2015 at 12:45 AM, Sachin Nikam skni...@gmail.com wrote:

 I am planning to use the Update...USING TIMESTAMP... statement to make
 sure that I do not overwrite fresh data with stale data while having to
 avoid doing at least LOCAL_QUORUM writes.

 Here is my table structure.

 Table=DocumentStore
 DocumentID (primaryKey, bigint)
 Document(text)
 Version(int)

 If the service receives 2 write requests with Version=1 and Version=2,
 regardless of the order of arrival, the business requirement is that we end
 up with Version=2 in the database.

 Can I use the following CQL Statement?

 Update DocumentStore using versionValue
 SET  Document=documentValue,
 Version=versionValue
 where DocumentID=documentIDValue;

 Has anybody used something like this? If so was the behavior as expected?

 Regards
 Sachin




 --
 Tyler Hobbs
 DataStax http://datastax.com/





Re: Adding a Cassandra node using OpsCenter

2015-03-12 Thread Nick Bailey
Correct, Opscenter can monitor 2.0.10 and later clusters/nodes. It just
can't provision them.

On Thu, Mar 12, 2015 at 1:16 PM, Ajay ajay.ga...@gmail.com wrote:

 Thanks Nick.

 Does it mean that only adding a new node with 2.0.10 or later is a
 problem?. If a new node added manually can be monitored from Opscenter?

 Thanks
 Ajay
 On 12-Mar-2015 10:19 pm, Nick Bailey n...@datastax.com wrote:

 There isn't an OpsCenter specific mailing list no.

 To answer your question, the reason OpsCenter provisioning doesn't
 support 2.0.10 and 2.0.11 is due to
 https://issues.apache.org/jira/browse/CASSANDRA-8072.

 That bug unfortunately prevents OpsCenter provisioning from working
 correctly, but isn't serious outside of provisioning. OpsCenter may be able
 to come up with a workaround but at the moment those versions are
 unsupported. Sorry for inconvenience.

 -Nick

 On Thu, Mar 12, 2015 at 9:18 AM, Ajay ajay.ga...@gmail.com wrote:

 Is there a separate forum for Opscenter?

 Thanks
 Ajay
 On 11-Mar-2015 4:16 pm, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 While adding a Cassandra node using OpsCenter (which is recommended),
 the versions of Cassandra (Datastax community edition) shows only 2.0.9 and
 not later versions in 2.0.x. Is there a reason behind it? 2.0.9 is
 recommended than 2.0.11?

 Thanks
 Ajay





Node data sync/recovery process

2015-03-12 Thread Akash Pandey
Hi

I had a doubt regarding C* node recovery process.

Assumption :  A two data center C* cluster with RF=3 and CL=LOCAL_QUORUM

Suppose a node went down for a period within hinted_handoff time. Once the
node comes up automatic data syncing would start for that node. This
recovery may take some time.
So my doubt is, during this period, if a read for a key stored on
recovering node comes, will the coordinator node ask for data from
recovering nodes which might possibly have stale data?

If yes, then, how does a C* client handle the situation when majority of
nodes for a key in one data center are recovering and it can end up getting
stale data?

Help much appreciated.

Thanks
Akash


Re: Steps to do after schema changes

2015-03-12 Thread Ajay
Thanks Mark.

-
Ajay
On 12-Mar-2015 11:08 pm, Mark Reddy mark.l.re...@gmail.com wrote:

 It's always good to run nodetool describecluster after a schema change,
 this will show you all the nodes in your cluster and what schema version
 they have. If they have different versions you have a schema disagreement
 and should follow this guide to resolution:
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_handle_schema_disagree_t.html

 Regards,
 Mark

 On 12 March 2015 at 05:47, Phil Yang ud1...@gmail.com wrote:

 Usually, you have nothing to do. Changes will be synced to every nodes
 automatically.

 2015-03-12 13:21 GMT+08:00 Ajay ajay.ga...@gmail.com:

 Hi,

 Are there any steps to do (like nodetool or restart node) or any
 precautions after schema changes are done in a column family say adding a
 new column or modifying any table properties?

 Thanks
 Ajay




 --
 Thanks,
 Phil Yang





Re: Adding a Cassandra node using OpsCenter

2015-03-12 Thread Ajay
Thanks Nick.

Does it mean that only adding a new node with 2.0.10 or later is a
problem?. If a new node added manually can be monitored from Opscenter?

Thanks
Ajay
On 12-Mar-2015 10:19 pm, Nick Bailey n...@datastax.com wrote:

 There isn't an OpsCenter specific mailing list no.

 To answer your question, the reason OpsCenter provisioning doesn't support
 2.0.10 and 2.0.11 is due to
 https://issues.apache.org/jira/browse/CASSANDRA-8072.

 That bug unfortunately prevents OpsCenter provisioning from working
 correctly, but isn't serious outside of provisioning. OpsCenter may be able
 to come up with a workaround but at the moment those versions are
 unsupported. Sorry for inconvenience.

 -Nick

 On Thu, Mar 12, 2015 at 9:18 AM, Ajay ajay.ga...@gmail.com wrote:

 Is there a separate forum for Opscenter?

 Thanks
 Ajay
 On 11-Mar-2015 4:16 pm, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 While adding a Cassandra node using OpsCenter (which is recommended),
 the versions of Cassandra (Datastax community edition) shows only 2.0.9 and
 not later versions in 2.0.x. Is there a reason behind it? 2.0.9 is
 recommended than 2.0.11?

 Thanks
 Ajay





Re: DataStax Enterprise Amazon AMI Launch Error

2015-03-12 Thread Ali Akhtar
Seems like its having trouble launching the other EC2 instances that you're
requesting. You would need to provide it your AWS credentials for an
account that has the permissions to create EC2 instances. Have you done
that?

If you just want to install cassandra on AWS, you might find this bash
script useful: https://gist.github.com/aliakhtar/3649e412787034156cbb

On Thu, Mar 12, 2015 at 5:14 PM, Vanessa Gligor vanessagli...@gmail.com
wrote:

 I'm trying to launch a new instance of DataStax AMI on a EC2 Amazon
 instance. I tried this in 2 different regions (us-east and eu-west), using
 these AMIs: ami-ada2b6c4, ami-814ec2e8 (us-east) and ami-7f33cd08,
 ami-b2212dc6 (eu-west).

 I followed this documentation:
 http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

 So this is what I've done so far:

 1. I've created a new security group (with those specific ports - I cannot
 upload the print screen because I have just created this account)

 2. I've create a new key pair

 3. I've launched the DataStax AMI with these configuration details:
 --clustername cluster --totalnodes 4 --version enterprise --username
 my_name --password my_password --searchnodes 2 (I have verified my
 credentials - I can login here http://debian.datastax.com/enterprise/ )

 4. After selecting the previous created security group  key pair I
 launched the instance

 5. I've connected to my DataStax Enterprise EC2 instance and this is the
 displayed log:

 Cluster started with these options: --clustername cluster --totalnodes 4
 --version enterprise --username my_name --password  --searchnodes 2

 03/12/15-08:59:23 Reflector: Received 1 of 2 responses from:
 [u'172.31.34.171']... Exception seen in ds1_launcher.py. Please check
 ~/datastax_ami/ami.log for more info. Please visit 


 and the ami.log shows these messages:


 [INFO] 03/12/15-08:59:23 Reflector: Received 1 of 2 responses from: 
 [u'172.31.34.171']
 [ERROR] EC2 is experiencing some issues and has not allocated all of the 
 resources in under 10 minutes.
 Aborting the clustering of this reservation. Please try again.
 [ERROR] Exception seen in ds1_launcher.py:
 Traceback (most recent call last):
 File /home/ubuntu/datastax_ami/ds1_launcher.py, line 22, in 
 initial_configurations
 ds2_configure.run()
  File /home/ubuntu/datastax_ami/ds2_configure.py, line 1135, in run
 File /home/ubuntu/datastax_ami/ds2_configure.py, line 57, in exit_path
 AttributeError: EC2 is experiencing some issues and has not allocated all of 
 the resources in under 10 minutes.
 Aborting the clustering of this reservation. Please try again.

 Any suggestion on how to fix this problem?

 Thank you!

 Have a nice day,

 Vanessa.




Re: how to clear data from disk

2015-03-12 Thread Ben Bromhead
To clarify on why this behaviour occurs, by default Cassandra will snapshot
a table when you perform any destructive action (TRUNCATE, DROP etc)

see
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/truncate_r.html

To free disk space after such an operation you will always need to clear
the snapshots (using either of above suggested methods). Unfortunately this
can be a bit painful if you are rotating your tables, say by month, and
want to remove the oldest one from disk as your client will need to speak
JMX as well.

You can disable this behaviour through the use of auto_snapshot in
cassandra.yaml. Though I would strongly recommend leaving this feature
enabled in any sane production environment and cleaning up snapshots as an
independent task!!

On 10 March 2015 at 20:43, Patrick McFadin pmcfa...@gmail.com wrote:

 Or just manually delete the files. The directories are broken down by
 keyspace and table.

 Patrick

 On Mon, Mar 9, 2015 at 7:50 PM, 曹志富 cao.zh...@gmail.com wrote:

 nodetool clearsnapshot

 --
 Ranger Tsao

 2015-03-10 10:47 GMT+08:00 鄢来琼 laiqiong@gtafe.com:

  Hi ALL,



 After drop table, I found the data is not removed from disk, I should
 reduce the gc_grace_seconds before the drop operation.

 I have to wait for 10 days, but there is not enough disk.

 Could you tell me there is method to clear the data from disk quickly?

 Thank you very much!



 Peter






-- 

Ben Bromhead

Instaclustr | www.instaclustr.com | @instaclustr
http://twitter.com/instaclustr | (650) 284 9692


Re: CQL 3.x Update ...USING TIMESTAMP...

2015-03-12 Thread Eric Stevens
Ok, but if you're using a system of time that isn't server clock oriented
(Sachin's document revision ID, and my fixed and necessarily consistent
base timestamp [B's always know their parent A's exact recorded
timestamp]), isn't the principle of using timestamps to force a particular
update out of several to win still sound?

 as using the clocks is only valid if clocks are perfectly sync'ed, which
they are not

Clock skew is a problem which doesn't seem to be a factor in either use
case given that both have a consistent external source of truth for
timestamp.

On Thu, Mar 12, 2015 at 12:58 PM, Jonathan Haddad j...@jonhaddad.com wrote:

 In most datacenters you're going to see significant variance in your
 server times.  Likely  20ms between servers in the same rack.  Even
 google, using atomic clocks, has 1-7ms variance.  [1]

 I would +1 Tyler's advice here, as using the clocks is only valid if
 clocks are perfectly sync'ed, which they are not, and likely never will be
 in our lifetime.

 [1] http://queue.acm.org/detail.cfm?id=2745385


 On Thu, Mar 12, 2015 at 7:04 AM Eric Stevens migh...@gmail.com wrote:

  It's possible, but you'll end up with problems when attempting to
 overwrite or delete entries

 I'm wondering if you can elucidate on that a little bit, do you just mean
 that it's easy to forget to always set your timestamp correctly, and if you
 goof it up, it makes it difficult to recover from (i.e. you issue a delete
 with system timestamp instead of document version, and that's way larger
 than your document version would ever be, so you can never write that
 document again)?  Or is there some bug in write timestamps that can cause
 the wrong entry to win the write contention?

 We're looking at doing something similar to keep a live max value column
 in a given table, our setup is as follows:

 CREATE TABLE a (
   id whatever,
   time timestamp,
   max_b_foo int,
   PRIMARY KEY (id)
 );
 CREATE TABLE b (
   b_id whatever,
   a_id whatever,
   a_timestamp timestamp,
   foo int,
   PRIMARY KEY (a_id, b_id)
 );

 The idea being that there's a one-to-many relationship between *a* and
 *b*.  We want *a* to know what the maximum value is in *b* for field
 *foo* so we can avoid reading *all* *b* when we want to resolve *a*. You
 can see that we can't just use *b*'s clustering key to resolve that with
 LIMIT 1; also this is for DSE Solr, which wouldn't be able to query a by
 max b.foo anyway.  So when we write to *b*, we also write to *a* with
 something like

 UPDATE a USING TIMESTAMP ${b.a_timestamp.toMicros + b.foo} SET max_b_foo
 = ${b.foo} WHERE id = ${b.a_id}

 Assuming that we don't run afoul of related antipatterns such as
 repeatedly overwriting the same value indefinitely, this strikes me as
 sound if unorthodox practice, as long as conflict resolution in Cassandra
 isn't broken in some subtle way.  We also designed this to be safe from
 getting write timestamps greatly out of sync with clock time so that
 non-timestamped operations (especially delete) if done accidentally will
 still have a reasonable chance of having the expected results.

 So while it may not be the intended use case for write timestamps, and
 there are definitely gotchas if you are not careful or misunderstand the
 consequences, as far as I can see the logic behind it is sound but does
 rely on correct conflict resolution in Cassandra.  I'm curious if I'm
 missing or misunderstanding something important.

 On Wed, Mar 11, 2015 at 4:11 PM, Tyler Hobbs ty...@datastax.com wrote:

 Don't use the version as your timestamp.  It's possible, but you'll end
 up with problems when attempting to overwrite or delete entries.

 Instead, make the version part of the primary key:

 CREATE TABLE document_store (document_id bigint, version int, document
 text, PRIMARY KEY (document_id, version)) WITH CLUSTERING ORDER BY (version
 desc)

 That way you don't have to worry about overwriting higher versions with
 a lower one, and to read the latest version, you only have to do:

 SELECT * FROM document_store WHERE document_id = ? LIMIT 1;

 Another option is to use lightweight transactions (i.e. UPDATE ... SET
 docuement = ?, version = ? WHERE document_id = ? IF version  ?), but
 that's going to make writes much more expensive.

 On Wed, Mar 11, 2015 at 12:45 AM, Sachin Nikam skni...@gmail.com
 wrote:

 I am planning to use the Update...USING TIMESTAMP... statement to make
 sure that I do not overwrite fresh data with stale data while having to
 avoid doing at least LOCAL_QUORUM writes.

 Here is my table structure.

 Table=DocumentStore
 DocumentID (primaryKey, bigint)
 Document(text)
 Version(int)

 If the service receives 2 write requests with Version=1 and Version=2,
 regardless of the order of arrival, the business requirement is that we end
 up with Version=2 in the database.

 Can I use the following CQL Statement?

 Update DocumentStore using versionValue
 SET  Document=documentValue,
 Version=versionValue
 where 

DataStax Enterprise Amazon AMI Launch Error

2015-03-12 Thread Vanessa Gligor
I'm trying to launch a new instance of DataStax AMI on a EC2 Amazon
instance. I tried this in 2 different regions (us-east and eu-west), using
these AMIs: ami-ada2b6c4, ami-814ec2e8 (us-east) and ami-7f33cd08,
ami-b2212dc6 (eu-west).

I followed this documentation:
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/install/installAMI.html

So this is what I've done so far:

1. I've created a new security group (with those specific ports - I cannot
upload the print screen because I have just created this account)

2. I've create a new key pair

3. I've launched the DataStax AMI with these configuration details:
--clustername cluster --totalnodes 4 --version enterprise --username
my_name --password my_password --searchnodes 2 (I have verified my
credentials - I can login here http://debian.datastax.com/enterprise/ )

4. After selecting the previous created security group  key pair I
launched the instance

5. I've connected to my DataStax Enterprise EC2 instance and this is the
displayed log:

Cluster started with these options: --clustername cluster --totalnodes 4
--version enterprise --username my_name --password  --searchnodes 2

03/12/15-08:59:23 Reflector: Received 1 of 2 responses from:
[u'172.31.34.171']... Exception seen in ds1_launcher.py. Please check
~/datastax_ami/ami.log for more info. Please visit 


and the ami.log shows these messages:


[INFO] 03/12/15-08:59:23 Reflector: Received 1 of 2 responses from:
[u'172.31.34.171']
[ERROR] EC2 is experiencing some issues and has not allocated all of
the resources in under 10 minutes.
Aborting the clustering of this reservation. Please try again.
[ERROR] Exception seen in ds1_launcher.py:
Traceback (most recent call last):
File /home/ubuntu/datastax_ami/ds1_launcher.py, line 22, in
initial_configurations
ds2_configure.run()
 File /home/ubuntu/datastax_ami/ds2_configure.py, line 1135, in run
File /home/ubuntu/datastax_ami/ds2_configure.py, line 57, in exit_path
AttributeError: EC2 is experiencing some issues and has not allocated
all of the resources in under 10 minutes.
Aborting the clustering of this reservation. Please try again.

Any suggestion on how to fix this problem?

Thank you!

Have a nice day,

Vanessa.


Re: Inconsistent count(*) and distinct results from Cassandra

2015-03-12 Thread Rumph, Frens Jan
Hi Jens, Mikhail, Daemeon,

Thanks for your replies. Sorry for my reply being late ... mails from the
user-list were moved to the wrong inbox on my side.

I'm in a development environment and thus using replication factor = 1 and
consistency = ONE with three nodes. So the 'results from different nodes
between queries' hypothesis seems unlikely to me. I would expect a timeout
if some node wouldn't be able to answer.

I tried tracing, but I couldn't really make any of it.

For example I performed two select distinct ... from ... queries: Traces
for both of them contained more than one line like 'Submitting range
requests on ... ranges ...' and 'Submitted ... concurrent range requests
covering ... ranges'. These lines occur with varying numbers, e.g. :

Submitting range requests on 593 ranges with a concurrency of 75 (1.35 rows
per range expected)
Submitting range requests on 769 ranges with a concurrency of 75 (1.35 rows
per range expected)


Also when looking at the lines like 'Executing seq scan across ... sstables
for ...' I saw that in one case which yielded way less partition keys that
only the tokens from -922337203685477  to -594461978511041000 were
included. In a case which yielded much more partition keys, the entire
token range did seem to be queried.

To reiterate my initial questions: is this behavior to be expected? Am I
doing something wrong? Is there a workaround?

Best regards,
Frens Jan

On 4 March 2015 at 22:59, daemeon reiydelle daeme...@gmail.com wrote:

 What is the replication? Could you be serving stale data from a node that
 was not properly replicated (hints timeout exceeded by a node being down?)



 On Wed, Mar 4, 2015 at 11:03 AM, Jens Rantil jens.ran...@tink.se wrote:

 Frens,

 What consistency are you querying with? Could be you are simply receiving
 result from different nodes each time.

 Jens

 –
 Skickat från Mailbox https://www.dropbox.com/mailbox


 On Wed, Mar 4, 2015 at 7:08 PM, Mikhail Strebkov streb...@gmail.com
 wrote:

 We have observed the same issue in our production Cassandra cluster (5
 nodes in one DC). We use Cassandra 2.1.3 (I joined the list too late to
 realize we shouldn’t user 2.1.x yet) on Amazon machines (created from
 community AMI).

 In addition to count variations with 5 to 10% we observe variations for
 the query “select * from table1 where time  '$fromDate' and time 
 '$toDate' allow filtering” results. We iterated through the results
 multiple times using official Java driver. We used that query for a huge
 data migration and were unpleasantly surprised that it is unreliable. In
 our case “nodetool repair” didn’t fix the issue.

 So I echo Frens questions.

 Thanks,
 Mikhail




 On Wed, Mar 4, 2015 at 3:55 AM, Rumph, Frens Jan m...@frensjan.nl
 wrote:

 Hi,

 Is it to be expected that select count(*) from ... and select distinct
 partition-key-columns from ... to yield inconsistent results between
 executions even though the table at hand isn't written to?

 I have a table in a keyspace with replication_factor = 1 which is
 something like:

  CREATE TABLE tbl (
 id frozenid_type,
 bucket bigint,
 offset int,
 value double,
 PRIMARY KEY ((id, bucket), offset)
 )

 The frozen udt is:

  CREATE TYPE id_type (
 tags maptext, text
 );

 When I do select count(*) from tbl several times the actual count
 varies with 5 to 10%. Also when performing select distinct id, bucket from
 tbl the results aren't consistent over several query executions. The table
 is not being written to at the time I performed the queries.

 Is this to be expected? Or is this a bug? Is there a alternative method
 / workaround?

 I'm using cqlsh 5.0.1 with Cassandra 2.1.2 on 64bit fedora 21 with
 Oracle Java 1.8.0_31.

 Thanks in advance,
 Frens Jan