RE: Question about replica and replication factor

Jun Wu Tue, 20 Sep 2016 08:46:44 -0700

Great explanation!
For the single partition read, it makes sense to read data from only one 
replica. 
Thank you so much Ben!
Jun

From: [email protected]
Date: Tue, 20 Sep 2016 05:30:43 +0000
Subject: Re: Question about replica and replication factor
To: [email protected]
CC: [email protected]

“replica” here means “a node that has a copy of the data for a given 
partition”. The scenario being discussed hear is CL > 1. In this case, rather 
than using up network and processing capacity sending all the data from all the 
nodes required to meet the consistency level, Cassandra gets the full data from 
one replica and  checksums from the others. Only if the checksums don’t match 
the full data does Cassandra need to get full data from all the relevant 
replicas.
I think the other point here is, conceptually, you should think of the 
coordinator as splitting up any query that hits multiple partitions into a set 
of queries, one per partition (there might be some optimisations that make this 
not quite physically correct but conceptually it’s about right). Discussion 
such as the one you quote above tend to be considering a single partition read 
(which is the most common kind of read in most uses of Cassandra).
CheersBen
On Tue, 20 Sep 2016 at 15:18 Jun Wu <[email protected]> wrote:

Yes, I think for my case, at least two nodes need to be contacted to get the 
full set of data.
But another thing comes up about dynamic snitch. It's the wrapped snitch and 
enabled by default and it'll choose the fastest/closest node to read data from. 
Another post is about 
this.http://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future

The thing is why it's still emphasis only one replica to read data from. Below 
is from the post:
To begin, let’s first answer the most obvious question: what is dynamic 
snitching? To understand this, we’ll first recall what a snitch does. A 
snitch’s function is to determine which datacenters and racks are both written 
to and read from. So, why would that be ‘dynamic?’ This comes into play on the 
read side only (there’s nothing to be done for writes since we send them all 
and then block to until the consistency level is achieved.) When doing reads 
however, Cassandra only asks one node for the actual data, and, depending on 
consistency level and read repair chance, it asks the remaining replicas for 
checksums only. This means that it has a choice of however many replicas exist 
to ask for the actual data, and this is where the dynamic snitch goes to 
work.Since only one replica is sending the full data we need, we need to chose 
the best possible replica to ask, since if all we get back is checksums we have 
nothing useful to return to the user. The dynamic snitch handles this task by 
monitoring the performance of reads from the various replicas and choosing the 
best one based on this history.
Sent from my iPadOn Sep 20, 2016, at 00:03, Ben Slater 
<[email protected]> wrote:

If your read operation requires data from multiple partitions and the 
partitions are spread across multiple nodes then the coordinator has the job of 
contacting the multiple nodes to get the data and return to the client. So, in 
your scenario, if you did a select * from table (with no where clause) the 
coordinator would need to contact and execute a read on at least one other node 
to satisfy the query.
CheersBen
On Tue, 20 Sep 2016 at 14:50 Jun Wu <[email protected]> wrote:

Hi Ben,
    Thanks for the quick response. 
    It's clear about the example for single row/partition. However, normally 
data are not single row. Then for this case, I'm still confused. 
http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/architectureClientRequestsRead_c.html
    The link above gives an example of 10 nodes cluster with RF = 3. But the 
figure and the words in the post shows that the coordinator only contact/read 
data from one replica, and operate read repair for the left replicas. 
    Also, how could read accross all nodes in the cluster? 
    Thanks!
Jun

From: [email protected]
Date: Tue, 20 Sep 2016 04:18:59 +0000
Subject: Re: Question about replica and replication factor
To: [email protected]

Each individual read (where a read is a single row or single partition) will 
read from one node (ignoring read repairs) as each partition will be contained 
entirely on a single node. To read the full set of data,  reads would hit at 
least two nodes (in practice, reads would likely end up being distributed 
across all the nodes in your cluster).
CheersBen
On Tue, 20 Sep 2016 at 14:09 Jun Wu <[email protected]> wrote:

Hi there,
    I have a question about the replica and replication factor. 
    For example, I have a cluster of 6 nodes in the same data center. 
Replication factor RF is set to 3  and the consistency level is default 1. 
According to this calculator http://www.ecyrd.com/cassandracalculator/, every 
node will store 50% of the data.
    When I want to read all data from the cluster, how many nodes should I read 
from, 2 or 1? Is it 2, because each node has half data? But in the calculator 
it show 1: You are really reading from 1 node every time.
   Any suggestions? Thanks!
Jun                                       -- 
————————Ben SlaterChief Product OfficerInstaclustr: Cassandra + Spark - Managed 
| Consulting | Support+61 437 929 798-- 
————————Ben SlaterChief Product OfficerInstaclustr: Cassandra + Spark - Managed 
| Consulting | Support+61 437 929 798
-- 
————————Ben SlaterChief Product OfficerInstaclustr: Cassandra + Spark - Managed 
| Consulting | Support+61 437 929 798

RE: Question about replica and replication factor

Reply via email to