subject:"Re\: Replicate On Write behavior"

Re: Replicate On Write behavior

2011-09-09 Thread David Hawthorne

They are evenly distributed.  5 nodes * 40 connections each using hector, and I 
can confirm that all 200 are active when this happened (from hector's 
perspective, from graphing the hector jmx data), and all 5 nodes saw roughly 40 
connections, and all were receiving traffic over those connections.  (netstat + 
ntop + trafshow, etc)  I can also confirm that I changed my insert strategy to 
break up the rows using composite row keys, which reduced the row lengths and 
gave me an almost perfectly even data distribution among the nodes, and that 
was when I started to really dig into why the ROWs were still backing up on one 
node specifically, and why 2 nodes weren't seeing any.  It was the 20%, 20%, 
60% ROW distribution that really got me thinking, and when I took the 60% node 
out of the cluster, that ROW load jumped back to the node with the next-lowest 
IP address, and the 2 nodes that weren't seeing any *still* wheren't seeing any 
ROWs.  At that point I tore down the cluster, recreated it as a 3 node cluster 
several times using various permutations of the 5 nodes available, and ROW load 
was *always* on the node with the lowest IP address.

the theory might not be right, but it certainly represents the behavior I saw.


On Sep 9, 2011, at 12:17 AM, Sylvain Lebresne wrote:

 We'll solve #2890 and we should have done it sooner.
 
 That being said, a quick question: how do you do your inserts from the
 clients ? Are you evenly
 distributing the inserts among the nodes ? Or are you always hitting
 the same coordinator ?
 
 Because provided the nodes are correctly distributed on the ring, if
 you distribute the inserts
 (increment) requests across the nodes (again I'm talking of client
 requests), you should not
 see the behavior you observe.
 
 --
 Sylvain
 
 On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
 It was exactly due to 2890, and the fact that the first replica is always 
 the one with the lowest value IP address.  I patched cassandra to pick a 
 random node out of the replica set in StorageProxy.java findSuitableEndpoint:
 
 Random rng = new Random();
 
 return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
 endpoints.get(0);
 
 Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
 inserts/sec throughput.
 
 Here's the behavior I saw, and disk work refers to the ReplicateOnWrite 
 load of a counter insert:
 
 One node will get RF/n of the disk work.  Two nodes will always get 0 disk 
 work.
 
 in a 3 node cluster, 1 node gets disk hit really hard.  You get the 
 performance of a one-node cluster.
 in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you 
 the performance of ~2 node cluster.
 in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
 performance of a ~3 node cluster.
 
 I confirmed this behavior with a 3, 4, and 5 node cluster size.
 
 
 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   Load
 OwnsToken

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00% 
  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00% 
  34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00% 
  68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00% 
  102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00% 
  136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain

Re: Replicate On Write behavior

2011-09-08 Thread David Hawthorne

It was exactly due to 2890, and the fact that the first replica is always the 
one with the lowest value IP address.  I patched cassandra to pick a random 
node out of the replica set in StorageProxy.java findSuitableEndpoint:

Random rng = new Random();

return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
endpoints.get(0);

Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
inserts/sec throughput.

Here's the behavior I saw, and disk work refers to the ReplicateOnWrite load 
of a counter insert:

One node will get RF/n of the disk work.  Two nodes will always get 0 disk work.

in a 3 node cluster, 1 node gets disk hit really hard.  You get the performance 
of a one-node cluster.
in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the 
performance of ~2 node cluster.
in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
performance of a ~3 node cluster.

I confirmed this behavior with a 3, 4, and 5 node cluster size.


 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain

Re: Replicate On Write behavior

2011-09-02 Thread Sylvain Lebresne

On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
I'm curious... digging through the source, it looks like replicate on write
triggers a read of the entire row, and not just the columns/supercolumns that
are affected by the counter update. Is this the case? It would certainly
explain why my inserts/sec decay over time and why the average insert latency
increases over time. The strange thing is that I'm not seeing disk read IO
increase over that same period, but that might be due to the OS buffer
cache...

It does not. It only reads the columns/supercolumns affected by the
counter update.
In the source, this happens in CounterMutation.java. If you look at
addReadCommandFromColumnFamily you'll see that it does a query by name
only for the column involved in the update (the update is basically
the content of the columnFamily parameter there).

And Cassandra does *not* always reads a full row. Never had, never will.

On another note, on a 5-node cluster, I'm only seeing 3 nodes with
ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal?
I'm using RandomPartitioner...

Address DC Rack Status State Load Owns
Token
136112946768375385385349842972707284580
10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% 0
10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00%
34028236692093846346337460743176821145
10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00%
68056473384187692692674921486353642290
10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00%
102084710076281539039012382229530463435
10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00%
136112946768375385385349842972707284580

The nodes with ReplicateOnWrites are the 3 in the middle. The first node and
last node both have a count of 0. This is a clean cluster, and I've been
doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours.
The last time this test ran, it went all the way down to 500 inserts/sec
before I killed it.

Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.

--
Sylvain

Re: Replicate On Write behavior

2011-09-02 Thread David Hawthorne

That's interesting.  I did an experiment wherein I added some entropy to the 
row name based on the time when the increment came in, (e.g. row = row + / + 
(timestamp - (timestamp % 300))) and now not only is the load (in GB) on my 
cluster more balanced, the performance has not decayed and has stayed steady 
(inserts/sec) with a relatively low average ms/insert.  Each row is now 
significantly shorter as a result of this change.



On Sep 2, 2011, at 12:30 AM, Sylvain Lebresne wrote:

 On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
 I'm curious... digging through the source, it looks like replicate on write 
 triggers a read of the entire row, and not just the columns/supercolumns 
 that are affected by the counter update.  Is this the case?  It would 
 certainly explain why my inserts/sec decay over time and why the average 
 insert latency increases over time.  The strange thing is that I'm not 
 seeing disk read IO increase over that same period, but that might be due to 
 the OS buffer cache...
 
 It does not. It only reads the columns/supercolumns affected by the
 counter update.
 In the source, this happens in CounterMutation.java. If you look at
 addReadCommandFromColumnFamily you'll see that it does a query by name
 only for the column involved in the update (the update is basically
 the content of the columnFamily parameter there).
 
 And Cassandra does *not* always reads a full row. Never had, never will.
 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain

Re: Replicate On Write behavior

2011-09-02 Thread Ian Danforth

That ticket explains a lot, looking forward to a resolution on it.
(Sorry I don't have a patch to offer)

Ian

On Fri, Sep 2, 2011 at 12:30 AM, Sylvain Lebresne sylv...@datastax.com wrote:
On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
I'm curious... digging through the source, it looks like replicate on write
triggers a read of the entire row, and not just the columns/supercolumns
that are affected by the counter update. Is this the case? It would
certainly explain why my inserts/sec decay over time and why the average
insert latency increases over time. The strange thing is that I'm not
seeing disk read IO increase over that same period, but that might be due to
the OS buffer cache...

And Cassandra does *not* always reads a full row. Never had, never will.

On another note, on a 5-node cluster, I'm only seeing 3 nodes with
ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that
normal? I'm using RandomPartitioner...

The nodes with ReplicateOnWrites are the 3 in the middle. The first node
and last node both have a count of 0. This is a clean cluster, and I've
been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12
hours. The last time this test ran, it went all the way down to 500
inserts/sec before I killed it.

Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.

--
Sylvain

Re: Replicate On Write behavior

2011-09-02 Thread David Hawthorne

Does it always pick the node with the lowest IP address?  All of my hosts are 
in the same /24.  The fourth node in the 5 node cluster has the lowest value in 
the 4th octet (54).  I erased the cluster and rebuilt it from scratch as a 3 
node cluster using the first 3 nodes, and now the ReplicateOnWrites are all 
going to the third node, which is also the lowest valued IP address (55).

That would explain why only 1 node gets writes in a 3 node cluster (RF=3) and 
why 3 nodes get writes in a 5 node cluster, and why one of those 3 is taking 
66% of the writes.


 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain

Re: Replicate On Write behavior

2011-09-01 Thread Yang

when Cassandra reads, the entire CF is always read together, only at the
hand-over to client does the pruning happens

On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne dha...@gmx.3crowd.comwrote:

 I'm curious... digging through the source, it looks like replicate on write
 triggers a read of the entire row, and not just the columns/supercolumns
 that are affected by the counter update.  Is this the case?  It would
 certainly explain why my inserts/sec decay over time and why the average
 insert latency increases over time.  The strange thing is that I'm not
 seeing disk read IO increase over that same period, but that might be due to
 the OS buffer cache...

 On another note, on a 5-node cluster, I'm only seeing 3 nodes with
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
 normal?  I'm using RandomPartitioner...

 Address DC  RackStatus State   LoadOwns
Token

  136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%
  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%
  34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%
  68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%
  102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%
  136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node
 and last node both have a count of 0.  This is a clean cluster, and I've
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12
 hours.  The last time this test ran, it went all the way down to 500
 inserts/sec before I killed it.

Re: Replicate On Write behavior

2011-09-01 Thread Konstantin Naryshkin

Yeah, I believe that Yan has a type in his post. A CF is no read in one go, a 
row is. As for the scalability of having all the columns being read at once, I 
do not believe that it was ever meant to be. All the columns in a row are 
stored together, on the same set of machines. This means that if you have very 
large rows, you can have an unbalanced cluster, but it also allows reads of 
several columns out of a row to be more efficient since they are all together 
on the same machine (no need to gather results from several machines) and 
should read quickly since they are all together on disk.

- Original Message -
From: Ian Danforth idanfo...@numenta.com
To: user@cassandra.apache.org
Sent: Thursday, September 1, 2011 4:35:33 PM
Subject: Re: Replicate On Write behavior

I'm not sure I understand the scalability of this approach. A given
column family can be HUGE with millions of rows and columns. In my
cluster I have a single column family that accounts for 90GB of load
on each node. Not only that but column family is distributed over the
entire ring.

Clearly I'm misunderstanding something.

Ian

On Thu, Sep 1, 2011 at 1:17 PM, Yang tedd...@gmail.com wrote:
 when Cassandra reads, the entire CF is always read together, only at the
 hand-over to client does the pruning happens

 On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne dha...@gmx.3crowd.com
 wrote:

 I'm curious... digging through the source, it looks like replicate on
 write triggers a read of the entire row, and not just the
 columns/supercolumns that are affected by the counter update.  Is this the
 case?  It would certainly explain why my inserts/sec decay over time and why
 the average insert latency increases over time.  The strange thing is that
 I'm not seeing disk read IO increase over that same period, but that might
 be due to the OS buffer cache...

 On another note, on a 5-node cluster, I'm only seeing 3 nodes with
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
 normal?  I'm using RandomPartitioner...

 Address         DC          Rack        Status State   Load
  Owns    Token

  136112946768375385385349842972707284580
 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%
  0
 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%
  34028236692093846346337460743176821145
 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%
  68056473384187692692674921486353642290
 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%
  102084710076281539039012382229530463435
 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%
  136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node
 and last node both have a count of 0.  This is a clean cluster, and I've
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12
 hours.  The last time this test ran, it went all the way down to 500
 inserts/sec before I killed it.

Re: Replicate On Write behavior

2011-09-01 Thread Yang

sorry i mean  cf * row

if you look in the code, db.cf  is just basically a set of columns
On Sep 1, 2011 1:36 PM, Ian Danforth idanfo...@numenta.com wrote:
 I'm not sure I understand the scalability of this approach. A given
 column family can be HUGE with millions of rows and columns. In my
 cluster I have a single column family that accounts for 90GB of load
 on each node. Not only that but column family is distributed over the
 entire ring.

 Clearly I'm misunderstanding something.

 Ian

 On Thu, Sep 1, 2011 at 1:17 PM, Yang tedd...@gmail.com wrote:
 when Cassandra reads, the entire CF is always read together, only at the
 hand-over to client does the pruning happens

 On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne dha...@gmx.3crowd.com
 wrote:

 I'm curious... digging through the source, it looks like replicate on
 write triggers a read of the entire row, and not just the
 columns/supercolumns that are affected by the counter update.  Is this
the
 case?  It would certainly explain why my inserts/sec decay over time and
why
 the average insert latency increases over time.  The strange thing is
that
 I'm not seeing disk read IO increase over that same period, but that
might
 be due to the OS buffer cache...

 On another note, on a 5-node cluster, I'm only seeing 3 nodes with
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
 normal?  I'm using RandomPartitioner...

 Address DC  RackStatus State   Load
  OwnsToken

  136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB
20.00%
  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB
20.00%
  34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB
20.00%
  68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB
20.00%
  102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB
20.00%
  136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first
node
 and last node both have a count of 0.  This is a clean cluster, and I've
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last
12
 hours.  The last time this test ran, it went all the way down to 500
 inserts/sec before I killed it.

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

9 matches

Site Navigation

Mail list logo

Footer information