Re: Tracking cardinality in Accumulo

2014-05-17 Thread David Medinets
>What's the expected size of your unique key set? Thousands? Millions? Billions? This project is something to occupy me my spare time. And it's intended to explore aspects of Accumulo that I haven't needed to use yet. In the past, I simply ran a map-reduce job using the Word Counting technique. t

Re: Tracking cardinality in Accumulo

2014-05-16 Thread Marc Parisi
On Fri, May 16, 2014 at 6:04 PM, Corey Nolet wrote: > What's the expected size of your unique key set? Thousands? Millions? > Billions? > > You could probably use a table structure similar to > https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut > just have it emit

Re: Tracking cardinality in Accumulo

2014-05-16 Thread Marc Parisi
woops, sorry for the empty response, but I'm new to E-mail. The bitset within HLL supports union and intersection. You should be able to estimate cardinality without re-reading the data. In effect, you can segment your estimation and minimize error < about 2%. Union is straightforward, whereas int

Re: Tracking cardinality in Accumulo

2014-05-16 Thread Corey Nolet
What's the expected size of your unique key set? Thousands? Millions? Billions? You could probably use a table structure similar to https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut just have it emit 1's instead of summing them. I'm thinking maybe your mappings cou

Re: Tracking cardinality in Accumulo

2014-05-16 Thread David Medinets
Yes, the data has not yet been ingested. I can control the table structure; hopefully by integrating (or extending) the D4M schema. I'm leaning towards using https://github.com/addthis/stream-lib as part of the ingest process. Upon start up, existing tables would be analyzed to find cardinality. T

Re: Tracking cardinality in Accumulo

2014-05-16 Thread Corey Nolet
Can we assume this data has not yet been ingested? Do you have control over the way in which you structure your table? On Fri, May 16, 2014 at 1:54 PM, David Medinets wrote: > If I have the following simple set of data: > > NAME John > NAME Jake > NAME John > NAME Mary > > I want to end up with

Re: Tracking cardinality in Accumulo

2014-05-16 Thread William Slacum
Yes. It will be less useful if you can't scan only the newest data, as you'll be recombining the same pieces of data on subsequent runs. On Fri, May 16, 2014 at 1:54 PM, David Medinets wrote: > If I have the following simple set of data: > > NAME John > NAME Jake > NAME John > NAME Mary > > I wa

Tracking cardinality in Accumulo

2014-05-16 Thread David Medinets
If I have the following simple set of data: NAME John NAME Jake NAME John NAME Mary I want to end up with the following: NAME 3 I'm thinking that perhaps a HyperLogLog approach should work. See http://en.wikipedia.org/wiki/HyperLogLog for more information. Has anyone done this before in Accumu