RE: Data rebalancing algorithm

2016-01-07 Thread Alec Collier
Have a look at this:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

The vnodes mechanism is there to provide better scalability as new nodes are 
added/removed, by allowing a single node to own several small chunks of the 
token range.

Aside from that, the process is exactly the same as in the single node case, 
the coordinator calculates the token based on partition key and locates the 
responsible node in the same way. SSTables are located on the node’s disk per 
Cassandra table, no reference to vnodes at all. The term virtual nodes is a bit 
misleading in that sense.

Actually, Cassandra does have a total number of vnodes per cluster. Its set 
with the num_tokens parameter in the Cassandra.yaml.

Alec

From: Sergi Vladykin [mailto:sergi.vlady...@gmail.com]
Sent: Friday, 25 December 2015 8:31 AM
To: user@cassandra.apache.org
Subject: Re: Data rebalancing algorithm

Thanks a lot for your answers!
Paulo, I'll take a look at classes you've suggested.
Jack, the link you've provided lacks description on how virtual nodes are 
mapped to phisical sstables/indexes on disk.
To be more exact, I have the following better detailed questions:

1. How vnodes are mapped to sstables and indexes? Is one vnode a separate part 
of the sstable or all the data from all vnodes just mixed in SSTable or may be 
something else?

2. As far as I see Cassandra does not have predefined constant total number of 
vnodes for the whole cluster, right? Does it mean that on rebalancing some 
parts of data already mapped to some vnodes will be remapped to new vnodes on 
the new node?
3. How long can take the rebalancing if we have lets say 1TB of data on a 
single node and we are adding one more node to the cluster?

Sergi


2015-12-24 19:26 GMT+03:00 Jack Krupansky 
>:
Read details here:
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html


-- Jack Krupansky

On Thu, Dec 24, 2015 at 11:09 AM, Paulo Motta 
> wrote:
The new node will own some parts (ranges) of the ring according to the ring 
tokens the node is responsible for. These tokens are defined from the yaml 
property initial_token (manual assignment) or num_tokens (random assignment).

During the bootstrap process raw data from sstables sections containing the 
ranges the node is responsible for are transferred from nodes that previously 
owned the range to the new node so the source sstables are rebuilt in the 
joining node. After each sstable is transferred the new node it rebuilds 
primary and secondary indexes, bloom filters, etc and in the end of the 
bootstrap process the new sstables are added to the live data set.
See org.apache.cassandra.dht.BootStrapper.java and 
org.apache.cassandra.streaming.StreamReceiveTask of the trunk branch for more 
information.
ps: I don't particularly recall any document with specific details, so if 
anyone knows please be welcome to share. If you want more theoretical 
information, see the ring membership sections of the cassandra and/or dynamo 
paper.


2015-12-24 13:14 GMT-02:00 Sergi Vladykin 
>:
Guys,
I was not able to find in docs or in google detailed description of data 
rebalancing algorithm.
I mean how Cassandra moves SSTables when new node connects to the cluster, how
primary and secondary indexes are getting transfered to this new node, etc..

Can anyone provide relevant links please or just reply here?
I can read source code of course, but it would be nice if someone could answer 
right away :)

Sergi




This email, including any attachments, is confidential. If you are not the 
intended recipient, you must not disclose, distribute or use the information in 
this email in any way. If you received this email in error, please notify the 
sender immediately by return email and delete the message. Unless expressly 
stated otherwise, the information in this email should not be regarded as an 
offer to sell or as a solicitation of an offer to buy any financial product or 
service, an official confirmation of any transaction, or as an official 
statement of the entity sending this message. Neither Macquarie Group Limited, 
nor any of its subsidiaries, guarantee the integrity of any emails or attached 
files and are not responsible for any changes made to them by any other person.


RE: I am a Datastax certified Cassandra architect now :)

2015-11-22 Thread Alec Collier
Congrats Prem!

I’m planning to go for it next month… any tips?

Alec

From: Prem Yadav [mailto:ipremya...@gmail.com]
Sent: Monday, 23 November 2015 5:01 AM
To: user@cassandra.apache.org
Subject: I am a Datastax certified Cassandra architect now :)

Just letting the community know that I just passed the Cassandra architect 
certification with flying colors :).
Have to say I learnt a lot from this forum.

Thanks,
Prem

This email, including any attachments, is confidential. If you are not the 
intended recipient, you must not disclose, distribute or use the information in 
this email in any way. If you received this email in error, please notify the 
sender immediately by return email and delete the message. Unless expressly 
stated otherwise, the information in this email should not be regarded as an 
offer to sell or as a solicitation of an offer to buy any financial product or 
service, an official confirmation of any transaction, or as an official 
statement of the entity sending this message. Neither Macquarie Group Limited, 
nor any of its subsidiaries, guarantee the integrity of any emails or attached 
files and are not responsible for any changes made to them by any other person.


RE: Order By limitation or bug?

2015-09-03 Thread Alec Collier
You should be able to execute the following

SELECT data FROM import_file WHERE roll = 1 AND type = 'foo' ORDER BY type, id 
DESC;

Essentially the order by clause has to specify the clustering columns in order 
in full. It doesn't by default know that you have already essentially filtered 
by type.

Alec Collier | Workplace Service Design
Corporate Operations Group - Technology | Macquarie Group Limited *

From: Robert Wille [mailto:rwi...@fold3.com]
Sent: Friday, 4 September 2015 7:17 AM
To: user@cassandra.apache.org
Subject: Re: Order By limitation or bug?

If you only specify the partition key, and none of the clustering columns, you 
can order by in either direction:

SELECT data FROM import_file WHERE roll = 1 order by type;
SELECT data FROM import_file WHERE roll = 1 order by type DESC;

These are both valid. Seems like specifying the prefix of the clustering 
columns is just a specialization of an already-supported pattern.

Robert

On Sep 3, 2015, at 2:46 PM, DuyHai Doan 
<doanduy...@gmail.com<mailto:doanduy...@gmail.com>> wrote:


Limitation, not bug. The reason ?

On disk, data are sorted by type first, and FOR EACH type value, the data are 
sorted by id.

So to do an order by Id, C* will need to perform an in-memory re-ordering, not 
sure how bad it is for performance. In any case currently it's not possible, 
maybe you should create a JIRA to ask for lifting the limitation.

On Thu, Sep 3, 2015 at 10:27 PM, Robert Wille 
<rwi...@fold3.com<mailto:rwi...@fold3.com>> wrote:

Given this table:

CREATE TABLE import_file (
  roll int,
  type text,
  id timeuuid,
  data text,
  PRIMARY KEY ((roll), type, id)
)

This should be possible:

SELECT data FROM import_file WHERE roll = 1 AND type = 'foo' ORDER BY id DESC;

but it results in the following error:

Bad Request: Order by currently only support the ordering of columns following 
their declared order in the PRIMARY KEY

I am ordering in the declared order in the primary key. I don't see why this 
shouldn't be able to be supported. Is this a known limitation or a bug?

In this example, I can get the results I want by omitting the ORDER BY clause 
and adding WITH CLUSTERING ORDER BY (id DESC) to the schema. However, now I can 
only get descending order. I have to choose either ascending or descending 
order. I cannot get both.

Robert




This email, including any attachments, is confidential. If you are not the 
intended recipient, you must not disclose, distribute or use the information in 
this email in any way. If you received this email in error, please notify the 
sender immediately by return email and delete the message. Unless expressly 
stated otherwise, the information in this email should not be regarded as an 
offer to sell or as a solicitation of an offer to buy any financial product or 
service, an official confirmation of any transaction, or as an official 
statement of the entity sending this message. Neither Macquarie Group Limited, 
nor any of its subsidiaries, guarantee the integrity of any emails or attached 
files and are not responsible for any changes made to them by any other person.

RE: Question: Coordinator in cassandra

2015-08-18 Thread Alec Collier
Hi Thouraya,

In all versions of Cassandra, each query will have one coordinator. Any node in 
the cluster can be the coordinator and this is choice is made by the 
client/driver. The coordinator may or may not have a replica of the data being 
requested.

Cheers,

Alec Collier | Workplace Service Design
Corporate Operations Group - Technology | Macquarie Group Limited •

From: Thouraya TH [mailto:thouray...@gmail.com]
Sent: Wednesday, 19 August 2015 2:48 AM
To: user@cassandra.apache.org
Subject: Question: Coordinator in cassandra

Hi all;

Please with the version cassandra 2.0.6, we have by default one coordinator per 
data center, per client or per query ?

Thank you so much for help.
Kind Regards.

This email, including any attachments, is confidential. If you are not the 
intended recipient, you must not disclose, distribute or use the information in 
this email in any way. If you received this email in error, please notify the 
sender immediately by return email and delete the message. Unless expressly 
stated otherwise, the information in this email should not be regarded as an 
offer to sell or as a solicitation of an offer to buy any financial product or 
service, an official confirmation of any transaction, or as an official 
statement of the entity sending this message. Neither Macquarie Group Limited, 
nor any of its subsidiaries, guarantee the integrity of any emails or attached 
files and are not responsible for any changes made to them by any other person.


RE: Schema questions for data structures with recently-modified access patterns

2015-07-22 Thread Alec Collier
I believe what he really wants is to be able to search for the x most recently 
modified documents, i.e. without specifying the docID.

I don’t believe there is a ‘nice’ way of doing this in Cassandra by itself, 
given it really favours key-value storage. Even having the date as the 
partition key is usually not recommended because it means all writes on a given 
date will be hitting one node.

Perhaps Solr integration is the way to go for this access pattern?

Alec Collier

From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Thursday, 23 July 2015 8:20 AM
To: user@cassandra.apache.org
Subject: Re: Schema questions for data structures with recently-modified access 
patterns

No way to query recently-modified documents.

I don't follow why you say that. I mean, that was the point of the data model 
suggestion I proposed. Maybe you could clarify.

I also wanted to mention that the new materialized view feature of Cassandra 
3.0 might handle this use case, including taking care of the delete, 
automatically.


-- Jack Krupansky

On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
The time series doesn’t provide the access pattern I’m looking for. No way to 
query recently-modified documents.

On Jul 21, 2015, at 9:13 AM, Carlos Alonso 
i...@mrcalonso.commailto:i...@mrcalonso.com wrote:


Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the 
row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == the docId LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can 
always expire old ones using TTL or on a batch job. Tombstones will never be a 
problem in this case as, due to the specified clustering order, the latest 
modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 21 July 2015 at 05:59, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Data structures that have a recently-modified access pattern seem to be a poor 
fit for Cassandra. I’m wondering if any of you smart guys can provide 
suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
docId UUID,
doc TEXT,
last_modified TIMEUUID,
PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
date TEXT,
last_modified TIMEUUID,
docId UUID,
PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current 
record from doc_by_last_modified, and add a new one. Unfortunately, if you’d 
like each document to appear at most once in the doc_by_last_modified table, 
then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there 
is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access 
to a table? This problem exists for many types of data structures, not just 
recently-modified. Any ordered data structure that can be dynamically reordered 
suffers from the same problems. As I’ve been doing schema design, this pattern 
keeps recurring. A nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert




This email, including any attachments, is confidential. If you are not the 
intended recipient, you must not disclose, distribute or use the information in 
this email in any way. If you received this email in error, please notify the 
sender immediately by return email and delete the message. Unless expressly 
stated otherwise, the information in this email should not be regarded as an 
offer to sell or as a solicitation of an offer to buy any financial product or 
service, an official confirmation of any transaction, or as an official 
statement of the entity sending this message. Neither Macquarie Group Limited, 
nor any of its subsidiaries, guarantee the integrity of any emails or attached 
files and are not responsible for any changes made to them by any other person.