Hi Yuki, I think I should have used the word discussion instead of proposal for the mailing subject. I have quite some of a design in my mind but I think it's not yet ripe enough to formalize. I'll try to simplify it and open a Jira ticket. But first I'm wondering if there would be any excitement in the community for such a feature.
Regards, Utku On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mor.y...@gmail.com> wrote: > You can open JIRA ticket at > https://issues.apache.org/jira/browse/CASSANDRA with your proposal. > > Just for the input: > > I had once implemented HyperLogLog counter to use internally in Cassandra, > but it turned out I didn't need it so I just put it to gist. You can find > it here: https://gist.github.com/2597943 > > The above implementation and most of the other ones (including stream-lib) > implement the optimized version of the algorithm which counts up to 10^9, > so may need some work. > > Other alternative is self-learning bitmap ( > http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my > understanding, is more memory efficient when counting small values. > > Yuki > > On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote: > > Hi All, > > Let's assume we have a use case where we need to count the number of > columns for a given key. Let's say the key is the URL and the column-name > is the IP address or any cardinality identifier. > > The straight forward implementation seems to be simple, just inserting the > IP Adresses as columns under the key defined by the URL and using get_count > to count them back. However the problem here is in case of large rows > (where too many IP addresses are in); the get_count method has to > de-serialize the whole row and calculate the count. As also defined in the > user guides, it's not an O(1) operation and it's quite costly. > > However, this problem seems to have better solutions if you don't have a > strict requirement for the count to be exact. There are streaming > algorithms that will provide good cardinality estimations within a > predefined failure rate, I think the most popular one seems to be the > (Hyper)LogLog algorithm, also there's an optimal one developed recently, > please check http://dl.acm.org/citation.cfm?doid=1807085.1807094 > > If you want to take a look at the Java implementation for LogLog, > Clearspring has both LogLog and space optimized HyperLogLog available at > https://github.com/clearspring/stream-lib > > I don't see a reason why this can't be implemented in Cassandra. The > distributed nature of all these algorithms can easily be adapted to > Cassandra's model. I think most of us would love to see come cardinality > estimating columns in Cassandra. > > Regards, > Utku > > >