Re: rainbird question (why is the 1minute buffer needed?)

Yang Mon, 23 May 2011 12:06:46 -0700

Thanks Ryan,

could you please share more details: according to what you observed in
testing,  why was performance  worse if you do not do extra buffering?


I was thinking (could be wrong)  that without extra buffering, the
counter update goes to Memtable.putIfPresent() and
CounterColumn.resolve(),
which are still in-memory operations, and thus would not be so bad ?


Yang

On Mon, May 23, 2011 at 11:54 AM, Ryan King <r...@twitter.com> wrote:
> On Sun, May 22, 2011 at 11:00 AM, Yang <teddyyyy...@gmail.com> wrote:
>> Thanks,
>>
>> I did read through that pdf doc, and went through the counters code in
>> 0.8-rc2, I think I understand the logic in that code.
>>
>> in my hypothetical implementation, I am not suggesting to overstep the
>> complicated logic in counters code, since the extra module will still
>> need to enter the increment through StorageProxy.mutate(
>> My_counter.delta=1 ) , so that the logical clock is still handled by
>> the Counters code.
>>
>>  the only difference is, as you said,
>> that rainbird collapses many +1 deltas. but my claim is that in fact
>> this "collapsing" is already done by cassandra since the write always
>> hit the memtable  first,
>> so collapsing in Cassandra memtable vs collapsing in rainbird  memory
>> takes the same time, while rainbird introduces an extra level of
>> caching (I am strongly suspecting that rainbird is vulnerable to
>> losing up to 1minute's worth of data , if the rainbird dies before the
>> writes are flushed to cassandra ---- unless it does implement its own
>> commit log, but that is kind of  re-implementing many of the wheels in
>> Cassandra ....)
>
> Right, Rainbird buffers for performance and can lose up to 1 minute of data.
>
>> I thought at one time probably the reason was because that from one
>> given url, rainbird needs to create writes on many keys, so that they
>> keys need to go to different
>> Cassandra nodes. but later I found that this can also be done in a
>> module on the coordinator, since the client request first hits a
>> coordinator, instead of the data node, in fact, in a multi-insert
>> case, the coordinator already sends the request to multiple data
>> nodes. the extra module I am proposing simply translates a single
>> insert into multi-insert, and then cassandra takes over from there
>>
>>
>> Thanks
>> Yang
>>
>> On Sun, May 22, 2011 at 3:47 AM, aaron morton <aa...@thelastpickle.com> 
>> wrote:
>>>  The implementation of distributed counters is  more complicated than your
>>> example, there is a design doc attached to the ticket
>>> here https://issues.apache.org/jira/browse/CASSANDRA-1072
>>> By collapsing some of those +1 increments together at the application level
>>> there is less work for the cluster to do. This can be important when the
>>> numbers are big http://blog.twitter.com/2011/03/numbers.html
>>> Cheers
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> On 21 May 2011, at 09:04, Yang wrote:
>>>
>>> (sorry if Rainbird is not a topic relevant enough, I'd appreciate if
>>> someone could point me to a more appropriate venue in that case)
>>>
>>>
>>> Rainbird buffers up 1 minute worth of events first before writing to
>>> Cassandra.
>>>
>>> it seems that this extra layer of buffering is repetitive, and could
>>> be avoided : Cassandra's memtable already does buffering, whose
>>> internal implementation is just
>>> Map.put(key, CF ) , I guess rainbird does similar things :
>>> column_to_count = map.get(key); column_to_count++ ; map.put(key,
>>> column_to_count) ??
>>> the "++" part is probably already done by the Distributed Counters in
>>> Cassandra.
>>> then I guess Rainbird layer exists because it needs to parse an
>>> incoming event into various attributes that it is interested in: for
>>> example from an url, we bump up the counts of
>>> FQDN , domain, path etc, Rainbird does the transformation from
>>> url--->3 attrs.
>>>
>>> but I guess that transformation might as well be done in the cassandra
>>> JVM itself, if we could provide some hooks, so that a module
>>> translates incoming request into
>>> multiple keys, and bump up their counts. that way we avoid the
>>> intermediate communication from clients to rainbird,  and rainbird to
>>> Cassandra. are there some points I'm missing?
>>>
>>> Thanks
>>> Yang
>>>
>>>
>>
>

Re: rainbird question (why is the 1minute buffer needed?)

Reply via email to