Re: counters + replication = awful performance?

Edward Capriolo Wed, 28 Nov 2012 11:58:37 -0800

Just for reference HBase's counters also do a local read. I am not saying
they work better/worse/faster/slower but I would not suspect any system
that reads on increment to me significantly faster then what Cassandra
does.


Just saying your counter throughput is read bound, this is not unique to
C*'s implementation.



On Wed, Nov 28, 2012 at 2:41 PM, Sergey Olefir <solf.li...@gmail.com> wrote:

> Well, those are sad news then. I don't think I can consider 20k increments
> per second for a two node cluster (with RF=2) a reasonable performance
> (cost
> vs. benefit).
>
> I might have to look into other storage solutions or perhaps experiment
> with
> duplicate clusters with RF=1 or replicate_on_write=false.
>
> Although yes, I probably should try that row cache you mentioned -- I saw
> that key cache was going unused (so saw no reason to try to enable row
> cache), but I think it was on RF=1, it might be different on RF=2.
>
>
> Sylvain Lebresne-3 wrote
> > Counters replication works in different ways than the one of "normal"
> > writes. Namely, a counter update is written to a first replica, then a
> > read
> > is perform and the result of that is replicated to the other nodes. With
> > RF=1, since there is only one replica no read is involved but in a way
> > it's
> > a degenerate case. So there is two reason why RF>2 is much slower than
> > RF=1:
> > 1) it involves a read to replicate and that read takes times. Especially
> > if
> > that read hits the disk, it may even dominate the insertion time.
> > 2) the replication to the first replica and the one to the res of the
> > replica are not done in parallel but sequentially. Note that this is only
> > true for the first replica versus the othere. In other words, from RF=2
> to
> > RF=3 you should see a significant performance degradation.
> >
> > Note that while there is nothing you can do for 2), you can try to speed
> > up
> > 1) by using row cache for instance (in case you weren't).
> >
> > In other words, with counters, it is expected that RF=1 be potentially
> > much
> > faster than RF>1. That is the way counters works.
> >
> > And don't get me wrong, I'm not suggesting you should use RF=1 at all.
> > What
> > I am saying is that the performance you see with RF=2 is the performance
> > of
> > counters in Cassandra.
> >
> > --
> > Sylvain
> >
> >
> > On Wed, Nov 28, 2012 at 7:34 AM, Sergey Olefir &lt;
>
> > solf.lists@
>
> > &gt; wrote:
> >
> >> I think there might be a misunderstanding as to the nature of the
> >> problem.
> >>
> >> Say, I have test set T. And I have two identical servers A and B.
> >> - I tested that server A (singly) is able to handle load of T.
> >> - I tested that server B (singly) is able to handle load of T.
> >> - I then join A and B in the cluster and set replication=2 -- this means
> >> that each server in effect has to handle full test load individually
> >> (because there are two servers and replication=2 it means that each
> >> server
> >> effectively has to handle all the data written to the cluster). Under
> >> these
> >> circumstances it is reasonable to assume that cluster A+B shall be able
> >> to
> >> handle load T because each server is able to do so individually.
> >>
> >> HOWEVER, this is not the case. In fact, A+B together are only able to
> >> handle
> >> less than 1/3 of T DESPITE the fact that A and B individually are able
> to
> >> handle T just fine.
> >>
> >> I think there's something wrong with Cassandra replication (possibly as
> >> simple as me misconfiguring something) -- it shouldn't be three times
> >> faster
> >> to write to two separate nodes in parallel as compared to writing to
> >> 2-node
> >> Cassandra cluster with replication=2.
> >>
> >>
> >> Edward Capriolo wrote
> >> > Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a
> >> node.
> >> > If you go to rf2 that is 100 inserts a node.  If you were at 75 %
> >> capacity
> >> > on each mode your now at 150% which is not possible so things bog
> down.
> >> >
> >> > To figure out what is going on we would need to see tpstat, iostat ,
> >> and
> >> > top information.
> >> >
> >> > I think your looking at the performance the wrong way. Starting off at
> >> rf
> >> > 1
> >> > is not the way to understand cassandra performance.
> >> >
> >> > You do not get the benefits of "scala out" don't happen until you fix
> >> your
> >> > rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes
> at
> >> rf
> >> > 3 even better.
> >> > On Tuesday, November 27, 2012, Sergey Olefir &lt;
> >>
> >> > solf.lists@
> >>
> >> > &gt; wrote:
> >> >> I already do a lot of in-memory aggregation before writing to
> >> Cassandra.
> >> >>
> >> >> The question here is what is wrong with Cassandra (or its
> >> configuration)
> >> >> that causes huge performance drop when moving from 1-replication to
> >> >> 2-replication for counters -- and more importantly how to resolve the
> >> >> problem. 2x-3x drop when moving from 1-replication to 2-replication
> on
> >> >> two
> >> >> nodes is reasonable. 6x is not. Like I said, with this kind of
> >> >> performance
> >> >> degradation it makes more sense to run two clusters with
> replication=1
> >> in
> >> >> parallel rather than rely on Cassandra replication.
> >> >>
> >> >> And yes, Rainbird was the inspiration for what we are trying to do
> >> here
> >> >> :)
> >> >>
> >> >>
> >> >>
> >> >> Edward Capriolo wrote
> >> >>> Cassandra's counters read on increment. Additionally they are
> >> >>> distributed
> >> >>> so that can be multiple reads on increment. If they are not fast
> >> enough
> >> >>> and
> >> >>> you have avoided all tuning options add more servers to handle the
> >> load.
> >> >>>
> >> >>> In many cases incrementing the same counter n times can be avoided.
> >> >>>
> >> >>> Twitter's rainbird did just that. It avoided multiple counter
> >> increments
> >> >>> by
> >> >>> batching them.
> >> >>>
> >> >>> I have done a similar think using cassandra and Kafka.
> >> >>>
> >> >>>
> >> >
> >>
> https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java
> >> >>>
> >> >>>
> >> >>> On Tuesday, November 27, 2012, Sergey Olefir &lt;
> >> >>
> >> >>> solf.lists@
> >> >>
> >> >>> &gt; wrote:
> >> >>>> Hi, thanks for your suggestions.
> >> >>>>
> >> >>>> Regarding replicate=2 vs replicate=1 performance: I expected that
> >> below
> >> >>>> configurations will have similar performance:
> >> >>>> - single node, replicate = 1
> >> >>>> - two nodes, replicate = 2 (okay, this probably should be a bit
> >> slower
> >> >>>> due
> >> >>>> to additional overhead).
> >> >>>>
> >> >>>> However what I'm seeing is that second option (replicate=2) is
> about
> >> >>>> THREE
> >> >>>> times slower than single node.
> >> >>>>
> >> >>>>
> >> >>>> Regarding replicate_on_write -- it is, in fact, a dangerous option.
> >> As
> >> >>> JIRA
> >> >>>> discusses, if you make changes to your ring (moving tokens and
> such)
> >> >>>> you
> >> >>>> will *silently* lose data. That is on top of whatever data you
> might
> >> >>>> end
> >> >>> up
> >> >>>> losing if you run replicate_on_write=false and the only node that
> >> got
> >> > the
> >> >>>> data fails.
> >> >>>>
> >> >>>> But what is much worse -- with replicate_on_write being false the
> >> data
> >> >>> will
> >> >>>> NOT be replicated (in my tests) ever unless you explicitly request
> >> the
> >> >>> cell.
> >> >>>> Then it will return the wrong result. And only on subsequent reads
> >> it
> >> >>>> will
> >> >>>> return adequate results. I haven't tested it, but documentation
> >> states
> >> >>> that
> >> >>>> range query will NOT do 'read repair' and thus will not force
> >> >>>> replication.
> >> >>>> The test I did went like this:
> >> >>>> - replicate_on_write = false
> >> >>>> - write something to node A (which should in theory replicate to
> >> node
> >> >>>> B)
> >> >>>> - wait for a long time (longest was on the order of 5 hours)
> >> >>>> - read from node B (and here I was getting null / wrong result)
> >> >>>> - read from node B again (here you get what you'd expect after read
> >> >>> repair)
> >> >>>>
> >> >>>> In essence, using replicate_on_write=false with rarely read data
> >> will
> >> >>>> practically defeat the purpose of having replication in the first
> >> place
> >> >>>> (failover, data redundancy).
> >> >>>>
> >> >>>>
> >> >>>> Or, in other words, this option doesn't look to be applicable to my
> >> >>>> situation.
> >> >>>>
> >> >>>> It looks like I will get much better performance by simply writing
> >> to
> >> > two
> >> >>>> separate clusters rather than using single cluster with
> replicate=2.
> >> >>>> Which
> >> >>>> is kind of stupid :) I think something's fishy with counters and
> >> >>>> replication.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> Edward Capriolo wrote
> >> >>>>> I mispoke really. It is not dangerous you just have to understand
> >> what
> >> >>>>> it
> >> >>>>> means. this jira discusses it.
> >> >>>>>
> >> >>>>> https://issues.apache.org/jira/browse/CASSANDRA-3868
> >> >>>>>
> >> >>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay &lt;
> >> >>>>
> >> >>>>> scottm@
> >> >>>>
> >> >>>>> &gt;wrote:
> >> >>>>>
> >> >>>>>>  We're having a similar performance problem.  Setting
> >> >>>>>> 'replicate_on_write:
> >> >>>>>> false' fixes the performance issue in our tests.
> >> >>>>>>
> >> >>>>>> How dangerous is it?  What exactly could go wrong?
> >> >>>>>>
> >> >>>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote:
> >> >>>>>>
> >> >>>>>> The difference between Replication factor =1 and replication
> >> factor
> >> >
> >> > 1
> >> >>>>>> is
> >> >>>>>> significant. Also it sounds like your cluster is 2 node so going
> >> from
> >> >>>>>> RF=1
> >> >>>>>> to RF=2 means double the load on both nodes.
> >> >>>>>>
> >> >>>>>>  You may want to experiment with the very dangerous column family
> >> >>>>>> attribute:
> >> >>>>>>
> >> >>>>>>  - replicate_on_write: Replicate every counter update from the
> >> leader
> >> >>>>>> to
> >> >>>>>> the
> >> >>>>>> follower replicas. Accepts the values true and false.
> >> >>>>>>
> >> >>>>>>  Edward
> >> >>>>>>  On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman <
> >> >>>>>>
> >> >>>>
> >> >>>>> mkjellman@
> >> >>>>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>> Are you writing with QUORUM consistency or ONE?
> >> >>>>>>>
> >> >>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" &lt;
> >> >>>>
> >> >>>>> solf.lists@
> >> >>>>
> >> >>>>> &gt; wrote:
> >> >>>>>>>
> >> >>>>>>> >Hi Juan,
> >> >>>>> cassandra-user@.apache
> >> >>
> >> >>>  mailing list archive at
> >> >>> Nabble.com.
> >> >>>>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >
> >>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html
> >> >> Sent from the
> >>
> >> > cassandra-user@.apache
> >>
> >> >  mailing list archive at
> >> > Nabble.com.
> >> >>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584031.html
> >> Sent from the
>
> > cassandra-user@.apache
>
> >  mailing list archive at
> >> Nabble.com.
> >>
>
>
>
>
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584052.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: counters + replication = awful performance?

Reply via email to