Re: Replicate On Write behavior

2011-09-09 Thread David Hawthorne
They are evenly distributed.  5 nodes * 40 connections each using hector, and I 
can confirm that all 200 are active when this happened (from hector's 
perspective, from graphing the hector jmx data), and all 5 nodes saw roughly 40 
connections, and all were receiving traffic over those connections.  (netstat + 
ntop + trafshow, etc)  I can also confirm that I changed my insert strategy to 
break up the rows using composite row keys, which reduced the row lengths and 
gave me an almost perfectly even data distribution among the nodes, and that 
was when I started to really dig into why the ROWs were still backing up on one 
node specifically, and why 2 nodes weren't seeing any.  It was the 20%, 20%, 
60% ROW distribution that really got me thinking, and when I took the 60% node 
out of the cluster, that ROW load jumped back to the node with the next-lowest 
IP address, and the 2 nodes that weren't seeing any *still* wheren't seeing any 
ROWs.  At that point I tore down the cluster, recreated it as a 3 node cluster 
several times using various permutations of the 5 nodes available, and ROW load 
was *always* on the node with the lowest IP address.

the theory might not be right, but it certainly represents the behavior I saw.


On Sep 9, 2011, at 12:17 AM, Sylvain Lebresne wrote:

 We'll solve #2890 and we should have done it sooner.
 
 That being said, a quick question: how do you do your inserts from the
 clients ? Are you evenly
 distributing the inserts among the nodes ? Or are you always hitting
 the same coordinator ?
 
 Because provided the nodes are correctly distributed on the ring, if
 you distribute the inserts
 (increment) requests across the nodes (again I'm talking of client
 requests), you should not
 see the behavior you observe.
 
 --
 Sylvain
 
 On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
 It was exactly due to 2890, and the fact that the first replica is always 
 the one with the lowest value IP address.  I patched cassandra to pick a 
 random node out of the replica set in StorageProxy.java findSuitableEndpoint:
 
 Random rng = new Random();
 
 return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
 endpoints.get(0);
 
 Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
 inserts/sec throughput.
 
 Here's the behavior I saw, and disk work refers to the ReplicateOnWrite 
 load of a counter insert:
 
 One node will get RF/n of the disk work.  Two nodes will always get 0 disk 
 work.
 
 in a 3 node cluster, 1 node gets disk hit really hard.  You get the 
 performance of a one-node cluster.
 in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you 
 the performance of ~2 node cluster.
 in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
 performance of a ~3 node cluster.
 
 I confirmed this behavior with a 3, 4, and 5 node cluster size.
 
 
 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   Load
 OwnsToken

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00% 
  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00% 
  34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00% 
  68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00% 
  102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00% 
  136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain
 
 



Re: Replicate On Write behavior

2011-09-08 Thread David Hawthorne
It was exactly due to 2890, and the fact that the first replica is always the 
one with the lowest value IP address.  I patched cassandra to pick a random 
node out of the replica set in StorageProxy.java findSuitableEndpoint:

Random rng = new Random();

return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
endpoints.get(0);

Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
inserts/sec throughput.

Here's the behavior I saw, and disk work refers to the ReplicateOnWrite load 
of a counter insert:

One node will get RF/n of the disk work.  Two nodes will always get 0 disk work.

in a 3 node cluster, 1 node gets disk hit really hard.  You get the performance 
of a one-node cluster.
in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the 
performance of ~2 node cluster.
in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
performance of a ~3 node cluster.

I confirmed this behavior with a 3, 4, and 5 node cluster size.


 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain



Re: Replicate On Write behavior

2011-09-02 Thread Sylvain Lebresne
On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
 I'm curious... digging through the source, it looks like replicate on write 
 triggers a read of the entire row, and not just the columns/supercolumns that 
 are affected by the counter update.  Is this the case?  It would certainly 
 explain why my inserts/sec decay over time and why the average insert latency 
 increases over time.  The strange thing is that I'm not seeing disk read IO 
 increase over that same period, but that might be due to the OS buffer 
 cache...

It does not. It only reads the columns/supercolumns affected by the
counter update.
In the source, this happens in CounterMutation.java. If you look at
addReadCommandFromColumnFamily you'll see that it does a query by name
only for the column involved in the update (the update is basically
the content of the columnFamily parameter there).

And Cassandra does *not* always reads a full row. Never had, never will.

 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that normal? 
  I'm using RandomPartitioner...

 Address         DC          Rack        Status State   Load            Owns   
  Token
                                                                            
 136112946768375385385349842972707284580
 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%  0
 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%  
 34028236692093846346337460743176821145
 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%  
 68056473384187692692674921486353642290
 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%  
 102084710076281539039012382229530463435
 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%  
 136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node and 
 last node both have a count of 0.  This is a clean cluster, and I've been 
 doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours.  
 The last time this test ran, it went all the way down to 500 inserts/sec 
 before I killed it.

Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.

--
Sylvain


Re: Replicate On Write behavior

2011-09-02 Thread David Hawthorne
That's interesting.  I did an experiment wherein I added some entropy to the 
row name based on the time when the increment came in, (e.g. row = row + / + 
(timestamp - (timestamp % 300))) and now not only is the load (in GB) on my 
cluster more balanced, the performance has not decayed and has stayed steady 
(inserts/sec) with a relatively low average ms/insert.  Each row is now 
significantly shorter as a result of this change.



On Sep 2, 2011, at 12:30 AM, Sylvain Lebresne wrote:

 On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
 I'm curious... digging through the source, it looks like replicate on write 
 triggers a read of the entire row, and not just the columns/supercolumns 
 that are affected by the counter update.  Is this the case?  It would 
 certainly explain why my inserts/sec decay over time and why the average 
 insert latency increases over time.  The strange thing is that I'm not 
 seeing disk read IO increase over that same period, but that might be due to 
 the OS buffer cache...
 
 It does not. It only reads the columns/supercolumns affected by the
 counter update.
 In the source, this happens in CounterMutation.java. If you look at
 addReadCommandFromColumnFamily you'll see that it does a query by name
 only for the column involved in the update (the update is basically
 the content of the columnFamily parameter there).
 
 And Cassandra does *not* always reads a full row. Never had, never will.
 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain



Re: Replicate On Write behavior

2011-09-02 Thread Ian Danforth
That ticket explains a lot, looking forward to a resolution on it.
(Sorry I don't have a patch to offer)

Ian

On Fri, Sep 2, 2011 at 12:30 AM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne dha...@gmx.3crowd.com wrote:
 I'm curious... digging through the source, it looks like replicate on write 
 triggers a read of the entire row, and not just the columns/supercolumns 
 that are affected by the counter update.  Is this the case?  It would 
 certainly explain why my inserts/sec decay over time and why the average 
 insert latency increases over time.  The strange thing is that I'm not 
 seeing disk read IO increase over that same period, but that might be due to 
 the OS buffer cache...

 It does not. It only reads the columns/supercolumns affected by the
 counter update.
 In the source, this happens in CounterMutation.java. If you look at
 addReadCommandFromColumnFamily you'll see that it does a query by name
 only for the column involved in the update (the update is basically
 the content of the columnFamily parameter there).

 And Cassandra does *not* always reads a full row. Never had, never will.

 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...

 Address         DC          Rack        Status State   Load            Owns  
   Token
                                                                            
 136112946768375385385349842972707284580
 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%  0
 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%  
 34028236692093846346337460743176821145
 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%  
 68056473384187692692674921486353642290
 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%  
 102084710076281539039012382229530463435
 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%  
 136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.

 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.

 --
 Sylvain



Re: Replicate On Write behavior

2011-09-02 Thread David Hawthorne
Does it always pick the node with the lowest IP address?  All of my hosts are 
in the same /24.  The fourth node in the 5 node cluster has the lowest value in 
the 4th octet (54).  I erased the cluster and rebuilt it from scratch as a 3 
node cluster using the first 3 nodes, and now the ReplicateOnWrites are all 
going to the third node, which is also the lowest valued IP address (55).

That would explain why only 1 node gets writes in a 3 node cluster (RF=3) and 
why 3 nodes get writes in a 5 node cluster, and why one of those 3 is taking 
66% of the writes.


 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...
 
 Address DC  RackStatus State   LoadOwns  
   Token

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
 34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
 68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
 102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
 136112946768375385385349842972707284580
 
 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
 
 Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
 
 --
 Sylvain



Re: Replicate On Write behavior

2011-09-01 Thread Yang
when Cassandra reads, the entire CF is always read together, only at the
hand-over to client does the pruning happens

On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne dha...@gmx.3crowd.comwrote:

 I'm curious... digging through the source, it looks like replicate on write
 triggers a read of the entire row, and not just the columns/supercolumns
 that are affected by the counter update.  Is this the case?  It would
 certainly explain why my inserts/sec decay over time and why the average
 insert latency increases over time.  The strange thing is that I'm not
 seeing disk read IO increase over that same period, but that might be due to
 the OS buffer cache...

 On another note, on a 5-node cluster, I'm only seeing 3 nodes with
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
 normal?  I'm using RandomPartitioner...

 Address DC  RackStatus State   LoadOwns
Token

  136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%
  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%
  34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%
  68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%
  102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%
  136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node
 and last node both have a count of 0.  This is a clean cluster, and I've
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12
 hours.  The last time this test ran, it went all the way down to 500
 inserts/sec before I killed it.


Re: Replicate On Write behavior

2011-09-01 Thread Konstantin Naryshkin
Yeah, I believe that Yan has a type in his post. A CF is no read in one go, a 
row is. As for the scalability of having all the columns being read at once, I 
do not believe that it was ever meant to be. All the columns in a row are 
stored together, on the same set of machines. This means that if you have very 
large rows, you can have an unbalanced cluster, but it also allows reads of 
several columns out of a row to be more efficient since they are all together 
on the same machine (no need to gather results from several machines) and 
should read quickly since they are all together on disk.

- Original Message -
From: Ian Danforth idanfo...@numenta.com
To: user@cassandra.apache.org
Sent: Thursday, September 1, 2011 4:35:33 PM
Subject: Re: Replicate On Write behavior

I'm not sure I understand the scalability of this approach. A given
column family can be HUGE with millions of rows and columns. In my
cluster I have a single column family that accounts for 90GB of load
on each node. Not only that but column family is distributed over the
entire ring.

Clearly I'm misunderstanding something.

Ian

On Thu, Sep 1, 2011 at 1:17 PM, Yang tedd...@gmail.com wrote:
 when Cassandra reads, the entire CF is always read together, only at the
 hand-over to client does the pruning happens

 On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne dha...@gmx.3crowd.com
 wrote:

 I'm curious... digging through the source, it looks like replicate on
 write triggers a read of the entire row, and not just the
 columns/supercolumns that are affected by the counter update.  Is this the
 case?  It would certainly explain why my inserts/sec decay over time and why
 the average insert latency increases over time.  The strange thing is that
 I'm not seeing disk read IO increase over that same period, but that might
 be due to the OS buffer cache...

 On another note, on a 5-node cluster, I'm only seeing 3 nodes with
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
 normal?  I'm using RandomPartitioner...

 Address         DC          Rack        Status State   Load
  Owns    Token

  136112946768375385385349842972707284580
 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%
  0
 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%
  34028236692093846346337460743176821145
 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%
  68056473384187692692674921486353642290
 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%
  102084710076281539039012382229530463435
 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%
  136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node
 and last node both have a count of 0.  This is a clean cluster, and I've
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12
 hours.  The last time this test ran, it went all the way down to 500
 inserts/sec before I killed it.



Re: Replicate On Write behavior

2011-09-01 Thread Yang
sorry i mean  cf * row

if you look in the code, db.cf  is just basically a set of columns
On Sep 1, 2011 1:36 PM, Ian Danforth idanfo...@numenta.com wrote:
 I'm not sure I understand the scalability of this approach. A given
 column family can be HUGE with millions of rows and columns. In my
 cluster I have a single column family that accounts for 90GB of load
 on each node. Not only that but column family is distributed over the
 entire ring.

 Clearly I'm misunderstanding something.

 Ian

 On Thu, Sep 1, 2011 at 1:17 PM, Yang tedd...@gmail.com wrote:
 when Cassandra reads, the entire CF is always read together, only at the
 hand-over to client does the pruning happens

 On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne dha...@gmx.3crowd.com
 wrote:

 I'm curious... digging through the source, it looks like replicate on
 write triggers a read of the entire row, and not just the
 columns/supercolumns that are affected by the counter update.  Is this
the
 case?  It would certainly explain why my inserts/sec decay over time and
why
 the average insert latency increases over time.  The strange thing is
that
 I'm not seeing disk read IO increase over that same period, but that
might
 be due to the OS buffer cache...

 On another note, on a 5-node cluster, I'm only seeing 3 nodes with
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
 normal?  I'm using RandomPartitioner...

 Address DC  RackStatus State   Load
  OwnsToken

  136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB
20.00%
  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB
20.00%
  34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB
20.00%
  68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB
20.00%
  102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB
20.00%
  136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first
node
 and last node both have a count of 0.  This is a clean cluster, and I've
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last
12
 hours.  The last time this test ran, it went all the way down to 500
 inserts/sec before I killed it.