Re: Distinct Counter Proposal for Cassandra

Chris Burroughs Fri, 29 Jun 2012 06:31:44 -0700

Well I obviously think it would be handy.  If this get's proposed and
end's up using stream-lib don't be shy about asking for help.


On a more general note, it would be great to see the special case
Counter code become more general atomic operation code.

On 06/13/2012 01:15 PM, Utku Can Topçu wrote:
> Hi Yuki,
> 
> I think I should have used the word discussion instead of proposal for the
> mailing subject. I have quite some of a design in my mind but I think it's
> not yet ripe enough to formalize. I'll try to simplify it and open a Jira
> ticket.
> But first I'm wondering if there would be any excitement in the community
> for such a feature.
> 
> Regards,
> Utku
> 
> On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mor.y...@gmail.com> wrote:
> 
>> You can open JIRA ticket at
>> https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
>>
>> Just for the input:
>>
>> I had once implemented HyperLogLog counter to use internally in Cassandra,
>> but it turned out I didn't need it so I just put it to gist. You can find
>> it here: https://gist.github.com/2597943
>>
>> The above implementation and most of the other ones (including stream-lib)
>> implement the optimized version of the algorithm which counts up to 10^9,
>> so may need some work.
>>
>> Other alternative is self-learning bitmap (
>> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
>> understanding, is more memory efficient when counting small values.
>>
>> Yuki
>>
>> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:
>>
>> Hi All,
>>
>> Let's assume we have a use case where we need to count the number of
>> columns for a given key. Let's say the key is the URL and the column-name
>> is the IP address or any cardinality identifier.
>>
>> The straight forward implementation seems to be simple, just inserting the
>> IP Adresses as columns under the key defined by the URL and using get_count
>> to count them back. However the problem here is in case of large rows
>> (where too many IP addresses are in); the get_count method has to
>> de-serialize the whole row and calculate the count. As also defined in the
>> user guides, it's not an O(1) operation and it's quite costly.
>>
>> However, this problem seems to have better solutions if you don't have a
>> strict requirement for the count to be exact. There are streaming
>> algorithms that will provide good cardinality estimations within a
>> predefined failure rate, I think the most popular one seems to be the
>> (Hyper)LogLog algorithm, also there's an optimal one developed recently,
>> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094
>>
>> If you want to take a look at the Java implementation for LogLog,
>> Clearspring has both LogLog and space optimized HyperLogLog available at
>> https://github.com/clearspring/stream-lib
>>
>> I don't see a reason why this can't be implemented in Cassandra. The
>> distributed nature of all these algorithms can easily be adapted to
>> Cassandra's model. I think most of us would love to see come cardinality
>> estimating columns in Cassandra.
>>
>> Regards,
>> Utku
>>
>>
>>
>

Re: Distinct Counter Proposal for Cassandra

Reply via email to