Just for reference HBase's counters also do a local read. I am not saying they work better/worse/faster/slower but I would not suspect any system that reads on increment to me significantly faster then what Cassandra does.
Just saying your counter throughput is read bound, this is not unique to C*'s implementation. On Wed, Nov 28, 2012 at 2:41 PM, Sergey Olefir <solf.li...@gmail.com> wrote: > Well, those are sad news then. I don't think I can consider 20k increments > per second for a two node cluster (with RF=2) a reasonable performance > (cost > vs. benefit). > > I might have to look into other storage solutions or perhaps experiment > with > duplicate clusters with RF=1 or replicate_on_write=false. > > Although yes, I probably should try that row cache you mentioned -- I saw > that key cache was going unused (so saw no reason to try to enable row > cache), but I think it was on RF=1, it might be different on RF=2. > > > Sylvain Lebresne-3 wrote > > Counters replication works in different ways than the one of "normal" > > writes. Namely, a counter update is written to a first replica, then a > > read > > is perform and the result of that is replicated to the other nodes. With > > RF=1, since there is only one replica no read is involved but in a way > > it's > > a degenerate case. So there is two reason why RF>2 is much slower than > > RF=1: > > 1) it involves a read to replicate and that read takes times. Especially > > if > > that read hits the disk, it may even dominate the insertion time. > > 2) the replication to the first replica and the one to the res of the > > replica are not done in parallel but sequentially. Note that this is only > > true for the first replica versus the othere. In other words, from RF=2 > to > > RF=3 you should see a significant performance degradation. > > > > Note that while there is nothing you can do for 2), you can try to speed > > up > > 1) by using row cache for instance (in case you weren't). > > > > In other words, with counters, it is expected that RF=1 be potentially > > much > > faster than RF>1. That is the way counters works. > > > > And don't get me wrong, I'm not suggesting you should use RF=1 at all. > > What > > I am saying is that the performance you see with RF=2 is the performance > > of > > counters in Cassandra. > > > > -- > > Sylvain > > > > > > On Wed, Nov 28, 2012 at 7:34 AM, Sergey Olefir < > > > solf.lists@ > > > > wrote: > > > >> I think there might be a misunderstanding as to the nature of the > >> problem. > >> > >> Say, I have test set T. And I have two identical servers A and B. > >> - I tested that server A (singly) is able to handle load of T. > >> - I tested that server B (singly) is able to handle load of T. > >> - I then join A and B in the cluster and set replication=2 -- this means > >> that each server in effect has to handle full test load individually > >> (because there are two servers and replication=2 it means that each > >> server > >> effectively has to handle all the data written to the cluster). Under > >> these > >> circumstances it is reasonable to assume that cluster A+B shall be able > >> to > >> handle load T because each server is able to do so individually. > >> > >> HOWEVER, this is not the case. In fact, A+B together are only able to > >> handle > >> less than 1/3 of T DESPITE the fact that A and B individually are able > to > >> handle T just fine. > >> > >> I think there's something wrong with Cassandra replication (possibly as > >> simple as me misconfiguring something) -- it shouldn't be three times > >> faster > >> to write to two separate nodes in parallel as compared to writing to > >> 2-node > >> Cassandra cluster with replication=2. > >> > >> > >> Edward Capriolo wrote > >> > Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a > >> node. > >> > If you go to rf2 that is 100 inserts a node. If you were at 75 % > >> capacity > >> > on each mode your now at 150% which is not possible so things bog > down. > >> > > >> > To figure out what is going on we would need to see tpstat, iostat , > >> and > >> > top information. > >> > > >> > I think your looking at the performance the wrong way. Starting off at > >> rf > >> > 1 > >> > is not the way to understand cassandra performance. > >> > > >> > You do not get the benefits of "scala out" don't happen until you fix > >> your > >> > rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes > at > >> rf > >> > 3 even better. > >> > On Tuesday, November 27, 2012, Sergey Olefir < > >> > >> > solf.lists@ > >> > >> > > wrote: > >> >> I already do a lot of in-memory aggregation before writing to > >> Cassandra. > >> >> > >> >> The question here is what is wrong with Cassandra (or its > >> configuration) > >> >> that causes huge performance drop when moving from 1-replication to > >> >> 2-replication for counters -- and more importantly how to resolve the > >> >> problem. 2x-3x drop when moving from 1-replication to 2-replication > on > >> >> two > >> >> nodes is reasonable. 6x is not. Like I said, with this kind of > >> >> performance > >> >> degradation it makes more sense to run two clusters with > replication=1 > >> in > >> >> parallel rather than rely on Cassandra replication. > >> >> > >> >> And yes, Rainbird was the inspiration for what we are trying to do > >> here > >> >> :) > >> >> > >> >> > >> >> > >> >> Edward Capriolo wrote > >> >>> Cassandra's counters read on increment. Additionally they are > >> >>> distributed > >> >>> so that can be multiple reads on increment. If they are not fast > >> enough > >> >>> and > >> >>> you have avoided all tuning options add more servers to handle the > >> load. > >> >>> > >> >>> In many cases incrementing the same counter n times can be avoided. > >> >>> > >> >>> Twitter's rainbird did just that. It avoided multiple counter > >> increments > >> >>> by > >> >>> batching them. > >> >>> > >> >>> I have done a similar think using cassandra and Kafka. > >> >>> > >> >>> > >> > > >> > https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java > >> >>> > >> >>> > >> >>> On Tuesday, November 27, 2012, Sergey Olefir < > >> >> > >> >>> solf.lists@ > >> >> > >> >>> > wrote: > >> >>>> Hi, thanks for your suggestions. > >> >>>> > >> >>>> Regarding replicate=2 vs replicate=1 performance: I expected that > >> below > >> >>>> configurations will have similar performance: > >> >>>> - single node, replicate = 1 > >> >>>> - two nodes, replicate = 2 (okay, this probably should be a bit > >> slower > >> >>>> due > >> >>>> to additional overhead). > >> >>>> > >> >>>> However what I'm seeing is that second option (replicate=2) is > about > >> >>>> THREE > >> >>>> times slower than single node. > >> >>>> > >> >>>> > >> >>>> Regarding replicate_on_write -- it is, in fact, a dangerous option. > >> As > >> >>> JIRA > >> >>>> discusses, if you make changes to your ring (moving tokens and > such) > >> >>>> you > >> >>>> will *silently* lose data. That is on top of whatever data you > might > >> >>>> end > >> >>> up > >> >>>> losing if you run replicate_on_write=false and the only node that > >> got > >> > the > >> >>>> data fails. > >> >>>> > >> >>>> But what is much worse -- with replicate_on_write being false the > >> data > >> >>> will > >> >>>> NOT be replicated (in my tests) ever unless you explicitly request > >> the > >> >>> cell. > >> >>>> Then it will return the wrong result. And only on subsequent reads > >> it > >> >>>> will > >> >>>> return adequate results. I haven't tested it, but documentation > >> states > >> >>> that > >> >>>> range query will NOT do 'read repair' and thus will not force > >> >>>> replication. > >> >>>> The test I did went like this: > >> >>>> - replicate_on_write = false > >> >>>> - write something to node A (which should in theory replicate to > >> node > >> >>>> B) > >> >>>> - wait for a long time (longest was on the order of 5 hours) > >> >>>> - read from node B (and here I was getting null / wrong result) > >> >>>> - read from node B again (here you get what you'd expect after read > >> >>> repair) > >> >>>> > >> >>>> In essence, using replicate_on_write=false with rarely read data > >> will > >> >>>> practically defeat the purpose of having replication in the first > >> place > >> >>>> (failover, data redundancy). > >> >>>> > >> >>>> > >> >>>> Or, in other words, this option doesn't look to be applicable to my > >> >>>> situation. > >> >>>> > >> >>>> It looks like I will get much better performance by simply writing > >> to > >> > two > >> >>>> separate clusters rather than using single cluster with > replicate=2. > >> >>>> Which > >> >>>> is kind of stupid :) I think something's fishy with counters and > >> >>>> replication. > >> >>>> > >> >>>> > >> >>>> > >> >>>> Edward Capriolo wrote > >> >>>>> I mispoke really. It is not dangerous you just have to understand > >> what > >> >>>>> it > >> >>>>> means. this jira discusses it. > >> >>>>> > >> >>>>> https://issues.apache.org/jira/browse/CASSANDRA-3868 > >> >>>>> > >> >>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay < > >> >>>> > >> >>>>> scottm@ > >> >>>> > >> >>>>> >wrote: > >> >>>>> > >> >>>>>> We're having a similar performance problem. Setting > >> >>>>>> 'replicate_on_write: > >> >>>>>> false' fixes the performance issue in our tests. > >> >>>>>> > >> >>>>>> How dangerous is it? What exactly could go wrong? > >> >>>>>> > >> >>>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote: > >> >>>>>> > >> >>>>>> The difference between Replication factor =1 and replication > >> factor > >> > > >> > 1 > >> >>>>>> is > >> >>>>>> significant. Also it sounds like your cluster is 2 node so going > >> from > >> >>>>>> RF=1 > >> >>>>>> to RF=2 means double the load on both nodes. > >> >>>>>> > >> >>>>>> You may want to experiment with the very dangerous column family > >> >>>>>> attribute: > >> >>>>>> > >> >>>>>> - replicate_on_write: Replicate every counter update from the > >> leader > >> >>>>>> to > >> >>>>>> the > >> >>>>>> follower replicas. Accepts the values true and false. > >> >>>>>> > >> >>>>>> Edward > >> >>>>>> On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman < > >> >>>>>> > >> >>>> > >> >>>>> mkjellman@ > >> >>>> > >> >>>>>> wrote: > >> >>>>>> > >> >>>>>>> Are you writing with QUORUM consistency or ONE? > >> >>>>>>> > >> >>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" < > >> >>>> > >> >>>>> solf.lists@ > >> >>>> > >> >>>>> > wrote: > >> >>>>>>> > >> >>>>>>> >Hi Juan, > >> >>>>> cassandra-user@.apache > >> >> > >> >>> mailing list archive at > >> >>> Nabble.com. > >> >>>> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> -- > >> >> View this message in context: > >> > > >> > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html > >> >> Sent from the > >> > >> > cassandra-user@.apache > >> > >> > mailing list archive at > >> > Nabble.com. > >> >> > >> > >> > >> > >> > >> > >> -- > >> View this message in context: > >> > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584031.html > >> Sent from the > > > cassandra-user@.apache > > > mailing list archive at > >> Nabble.com. > >> > > > > > > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584052.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com. >