Well I obviously think it would be handy. If this get's proposed and end's up using stream-lib don't be shy about asking for help.
On a more general note, it would be great to see the special case Counter code become more general atomic operation code. On 06/13/2012 01:15 PM, Utku Can Topçu wrote: > Hi Yuki, > > I think I should have used the word discussion instead of proposal for the > mailing subject. I have quite some of a design in my mind but I think it's > not yet ripe enough to formalize. I'll try to simplify it and open a Jira > ticket. > But first I'm wondering if there would be any excitement in the community > for such a feature. > > Regards, > Utku > > On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mor.y...@gmail.com> wrote: > >> You can open JIRA ticket at >> https://issues.apache.org/jira/browse/CASSANDRA with your proposal. >> >> Just for the input: >> >> I had once implemented HyperLogLog counter to use internally in Cassandra, >> but it turned out I didn't need it so I just put it to gist. You can find >> it here: https://gist.github.com/2597943 >> >> The above implementation and most of the other ones (including stream-lib) >> implement the optimized version of the algorithm which counts up to 10^9, >> so may need some work. >> >> Other alternative is self-learning bitmap ( >> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my >> understanding, is more memory efficient when counting small values. >> >> Yuki >> >> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote: >> >> Hi All, >> >> Let's assume we have a use case where we need to count the number of >> columns for a given key. Let's say the key is the URL and the column-name >> is the IP address or any cardinality identifier. >> >> The straight forward implementation seems to be simple, just inserting the >> IP Adresses as columns under the key defined by the URL and using get_count >> to count them back. However the problem here is in case of large rows >> (where too many IP addresses are in); the get_count method has to >> de-serialize the whole row and calculate the count. As also defined in the >> user guides, it's not an O(1) operation and it's quite costly. >> >> However, this problem seems to have better solutions if you don't have a >> strict requirement for the count to be exact. There are streaming >> algorithms that will provide good cardinality estimations within a >> predefined failure rate, I think the most popular one seems to be the >> (Hyper)LogLog algorithm, also there's an optimal one developed recently, >> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094 >> >> If you want to take a look at the Java implementation for LogLog, >> Clearspring has both LogLog and space optimized HyperLogLog available at >> https://github.com/clearspring/stream-lib >> >> I don't see a reason why this can't be implemented in Cassandra. The >> distributed nature of all these algorithms can easily be adapted to >> Cassandra's model. I think most of us would love to see come cardinality >> estimating columns in Cassandra. >> >> Regards, >> Utku >> >> >> >