AW: How does clustering key works with TimeWindowCompactionStrategy (TWCS)

2017-04-07 Thread j.kesten
Hi Jerry,

the compaction strategy just tells Cassandra how to compact your sstables and 
with TWCS when to stop compacting further. But of course your data can and most 
likely will live in multiple sstables. 

The magic that happens is the the coordinator node for your request will merge 
the data for you on the fly. It is an easy job, as your data per sstable is 
already sorted.

But be careful, if you end up with a worst case. If a customer_id is insertet 
every hour you can end up with reading many sstables decreasing read 
performance if the data should be kept a year or so.

Jan

Gesendet von meinem Windows 10 Phone

Von: Jerry Lam
Gesendet: Freitag, 7. April 2017 00:30
An: user@cassandra.apache.org
Betreff: How does clustering key works with TimeWindowCompactionStrategy (TWCS)

Hi guys,

I'm a new and happy user of Cassandra. We are using Cassandra for time series 
data so we choose TWCS because of its predictability and its ease of 
configuration.

My question is we have a table with the following schema:

CREATE TABLE IF NOT EXISTS customer_view (
customer_id bigint,
date_day Timestamp,
view_id bigint,
PRIMARY KEY (customer_id, date_day)
) WITH CLUSTERING ORDER BY (date_day DESC)

What I understand is that the data will be order by date_day within the 
partition using the clustering key. However, the same customer_id can be 
inserted to this partition several times during the day and the TWCS says it 
will only compact the sstables within the window interval set in the 
configuration (in our case is 1 hour). 

How does Cassandra guarantee the clustering key order when the same customer_id 
appears in several sstables? Does it need to do a merge and then sort to find 
out the latest view_id for the customer_id? Or there are some magics happen 
behind the book can tell?

Best Regards,

Jerry



AW: The changing clustering key

2017-04-06 Thread j.kesten
Hi,

your primary goal is to fetch a user by dept_id and user_id and additionally 
keep versions of the user data?

{
   dept_id text,
   user_id text,
   mod_date timestamp,
   user_name text,
   PRIMARY KEY ((dept_id,user_id), mod_date)
   WITH CLUSTERING ORDER BY (mod_date DESC);
}

There is a difference between partition key and cluster keys. My suggestion 
will end up with all versions of a particular (dept_id,user_id) on a partition 
(say node) and all versions of your data on that portion in descending order by 
mod_date. 

For a normal loopkup you do not need to know mod_date, a simple SELECT * FROM 
users WHERE dept_id=foo and user_id=bar LIMIT 1 will do.

http://datascale.io/cassandra-partitioning-and-clustering-keys-explained/



Gesendet von meinem Windows 10 Phone

Von: Monmohan Singh
Gesendet: Donnerstag, 6. April 2017 13:54
An: user@cassandra.apache.org
Betreff: The changing clustering key

Dear Cassandra experts,
I have a data modeling question for cases where data needs to be sorted by keys 
which can be modified.
So , say we have a user table
{
   dept_id text,
   user_id text,
   user_name text,
   mod_date timestamp
   PRIMARY KEY (dept_id,user_id)
}
Now I can query cassandra to get all users by a dept_id
What if I wanted to query to get all users in a dept, sorted by mod_date.
So, one way would be to
{
   dept_id text,
   user_id text,
   mod_date timestamp,
   user_name text,
   PRIMARY KEY (dept_id,user_id, mod_date)
}
But, mod_date changes every time user name is updated. So it can't be part of 
clustering key.

Attempt 1:  Don't update the row but instead create new record for every 
update. So, say the record for user foo is like below
{'dept_id1','user_id1',TimeStamp1','foo'} and then the name was changed to 
'bar' and then to 'baz' . In that case we add another row to table, so the 
table data would look like

{'dept_id1','user_id1',TimeStamp3','baz'}
{'dept_id1','user_id1',TimeStamp2','bar'}
{'dept_id1','user_id1',TimeStamp1','foo'}

Now we can get all users in a dept, sorted by mod_date but it presents a 
different problem. The data returned is duplicated. 

Attempt 2 : Add another column to identify the head record much like a linked 
list
{
   dept_id text,
   user_id text,
   mod_date timestamp,
   user_name text,
   next_record text
   PRIMARY KEY (user_id,user_id, mod_date)
}
Every time an update happens it adds a row and also adds the PK of new record 
except in the latest record.

{'dept_id1','user_id1',TimeStamp3','baz','HEAD'}
{'dept_id1','user_id1',TimeStamp2','bar','dept_id1#user_id1#TimeStamp3'}
{'dept_id1','user_id1',TimeStamp1','foo','dept_id1#user_id1#TimeStamp2'}
and also add a secondary index to 'next_record' column.

Now I can support get all users in a dept, sorted by mod_date by
SELECT * from USERS where dept_id=':dept' AND next_record='HEAD' order by 
mod_date.

But it looks fairly involved solution and perhaps I am missing something , a 
simpler solution ..

The other option is delete and insert but for high frequency changes I think 
Cassandra has issues with tombstones.

Thanks for helping on this.
Regards
Monmohan




AW: question on maximum disk seeks

2017-03-20 Thread j.kesten
Hi,

youre right – one seek with hit in the partition key cache and two if not.

Thats the theory – but two thinge to mention:

First, you need two seeks per sstable not per entire read. So if you data is 
spread over multiple sstables on disk you obviously need more then two reads. 
Think of often updated partition keys – in combination with memory preassure 
you can easily end up with maaany sstables (ok they will be compacted some time 
in the future).

Second, there could be fragmentation on disk which leads to seeks during 
sequential reads. 

Jan

Gesendet von meinem Windows 10 Phone

Von: preetika tyagi
Gesendet: Montag, 20. März 2017 21:18
An: user@cassandra.apache.org
Betreff: question on maximum disk seeks



I'm trying to understand the maximum number of disk seeks required in a read 
operation in Cassandra. I looked at several online articles including this one: 
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is 
for reading the partition index and another is to read the actual data from the 
compressed partition. The index of the data in compressed partitions is 
obtained from the compression offset tables (which is stored in memory). Am I 
on the right track here? Will there ever be a case when more than 1 disk seek 
is required to read the data?
Thanks,
Preetika




AW: How can I scale my read rate?

2017-03-18 Thread j.kesten
+1 for executeAsync – had a long time to argue that it’s not bad as with good 
old rdbms. 



Gesendet von meinem Windows 10 Phone

Von: Arvydas Jonusonis
Gesendet: Samstag, 18. März 2017 19:08
An: user@cassandra.apache.org
Betreff: Re: How can I scale my read rate?

..then you're not taking advantage of request pipelining. Use executeAsync - 
this will increase your throughput for sure.

http://www.datastax.com/dev/blog/java-driver-async-queries


On Sat, Mar 18, 2017 at 08:00 S G  wrote:
I have enabled JMX but not sure what metrics to look for - they are way too 
many of them.
I am using session.execute(...)


On Fri, Mar 17, 2017 at 2:07 PM, Arvydas Jonusonis 
 wrote:
It would be interesting to see some of the driver metrics (in your stress test 
tool) - if you enable JMX, they should be exposed by default.

Also, are you using session.execute(..) or session.executeAsync(..) ?





AW: Issue with Cassandra consistency in results

2017-03-16 Thread j.kesten
Hi,

doing a quick scan over the thread two things that came into my mind:

Frist, did the restore copy the sstables to the right machines back? Node As 
data to node A and so on? 

Second, did you run full repairs on every node? Not just incremental ones which 
now is the default?

Also a look into debug.log is an option.

If all done already, nevermind.

Gesendet von meinem Windows 10 Phone

Von: Ryan Svihla
Gesendet: Donnerstag, 16. März 2017 18:57
An: user
Betreff: Re: Issue with Cassandra consistency in results

Depends actually, restore just restores what's there, so if only one node had a 
copy of the data then only one node had a copy of the data meaning quorum will 
still be wrong sometimes.

On Thu, Mar 16, 2017 at 1:53 PM, Arvydas Jonusonis 
 wrote:
If the data was written at ONE, consistency is not guaranteed. ..but 
considering you just restored the cluster, there's a good chance something else 
is off.

On Thu, Mar 16, 2017 at 18:19 srinivasarao daruna  
wrote:
Want to make read and write QUORUM as well. 


On Mar 16, 2017 1:09 PM, "Ryan Svihla"  wrote:
        Replication factor is 3, and write consistency is ONE and read 
consistency is QUORUM.

That combination is not gonna work well:

Write succeeds to NODE A but fails on node B,C

Read goes to NODE B, C

If you can tolerate some temporary inaccuracy you can use QUORUM but may still 
have the situation where

Write succeeds on node A a timestamp 1, B succeeds at timestamp 2
Read succeeds on node B and C at timestamp 1 

If you need fully race condition free counts I'm afraid you need to use SERIAL 
or LOCAL_SERIAL (for in DC only accuracy)

On Thu, Mar 16, 2017 at 1:04 PM, srinivasarao daruna  
wrote:
Replication strategy is SimpleReplicationStrategy.

Smith is : EC2 snitch. As we deployed cluster on EC2 instances.

I was worried that CL=ALL have more read latency and read failures. But won't 
rule out trying it.

Should I switch select count (*) to select partition_key column? Would that be 
of any help.?


Thank you 
Regards
Srini

On Mar 16, 2017 12:46 PM, "Arvydas Jonusonis"  
wrote:
What are your replication strategy and snitch settings?

Have you tried doing a read at CL=ALL? If it's an actual inconsistency issue 
(missing data), this should cause the correct results to be returned. You'll 
need to run a repair to fix the inconsistencies.

If all the data is actually there, you might have one or several nodes that 
aren't identifying the correct replicas.

Arvydas



On Thu, Mar 16, 2017 at 5:31 PM, srinivasarao daruna  
wrote:
Hi Team, 

We are struggling with a problem related to cassandra counts, after backup and 
restore of the cluster. Aaron Morton has suggested to send this to user list, 
so some one of the list will be able to help me. 

We are have a rest api to talk to cassandra and one of our query which fetches 
count is creating problems for us.

We have done backup and restore and copied all the data to new cluster. We have 
done nodetool refresh on the tables, and did the nodetool repair as well.

However, one of our key API call is returning inconsistent results. The result 
count is 0 in the first call and giving the actual values for later calls. The 
query frequency is bit high and failure rate has also raised considerably.

1) The count query has partition keys in it. Didnt see any read timeout or any 
errors from api logs.

2) This is how our code of creating session looks.

val poolingOptions = new PoolingOptions
    poolingOptions
      .setCoreConnectionsPerHost(HostDistance.LOCAL, 4)
      .setMaxConnectionsPerHost(HostDistance.LOCAL, 10)
      .setCoreConnectionsPerHost(HostDistance.REMOTE, 4)
      .setMaxConnectionsPerHost( HostDistance.REMOTE, 10)

val builtCluster = clusterBuilder.withCredentials(username, password)
      .withPoolingOptions(poolingOptions)
      .build()
val cassandraSession = builtCluster.get.connect()

val preparedStatement = 
cassandraSession.prepare(statement).setConsistencyLevel(ConsistencyLevel.QUORUM)
cassandraSession.execute(preparedStatement.bind(args :_*))

Query: SELECT count(*) FROM table_name WHERE parition_column=? AND 
text_column_of_clustering_key=? AND date_column_of_clustering_key<=? AND 
date_column_of_clustering_key>=?

3) Cluster configuration:

6 Machines: 3 seeds, we are using apache cassandra 3.9 version. Each machine is 
equipped with 16 Cores and 64 GB Ram.

        Replication factor is 3, and write consistency is ONE and read 
consistency is QUORUM.

4) cassandra is never down on any machine

5) Using cassandra-driver-core artifact with 3.1.1 version in the api.

6) nodetool tpstats shows no read failures, and no other failures.

7) Do not see any other issues from system.log of cassandra. We just see few 
warnings as below.

Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
WARN  

AW: Why does CockroachDB github website say Cassandra has noAvailability on datacenter failure?

2017-02-07 Thread j.kesten
Deeper inside there is a diagram:

https://raw.githubusercontent.com/cockroachdb/cockroach/master/docs/media/sql-nosql-newsql.png

They compare to NoSQL along with Riak, HBase and Cassandra. 

Of course you CAN have a Cassandra cluster which is not fully available with 
loss of a dc nor consistent. 

Marketing 

Gesendet von meinem Windows 10 Phone

Von: DuyHai Doan
Gesendet: Dienstag, 7. Februar 2017 11:53
An: d...@cassandra.apache.org
Cc: user@cassandra.apache.org
Betreff: Re: Why does CockroachDB github website say Cassandra has 
noAvailability on datacenter failure?

The link you posted doesn't say anything about Cassandra 
Le 7 févr. 2017 11:41, "Kant Kodali"  a écrit :
Why does CockroachDB github website say Cassandra has no Availability on
datacenter failure?

https://github.com/cockroachdb/cockroach