woops, sorry for the empty response, but I'm new to E-mail. The bitset within HLL supports union and intersection. You should be able to estimate cardinality without re-reading the data. In effect, you can segment your estimation and minimize error < about 2%.
Union is straightforward, whereas intersection is |FIELD+1| + |FIELD_2| - |FIELD_1 UNION FIELD_2| On Fri, May 16, 2014 at 9:17 PM, Marc Parisi <[email protected]> wrote: > > > > On Fri, May 16, 2014 at 6:04 PM, Corey Nolet <[email protected]> wrote: > >> What's the expected size of your unique key set? Thousands? Millions? >> Billions? >> >> You could probably use a table structure similar to >> https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut >> just have it emit 1's instead of summing them. >> >> I'm thinking maybe your mappings could be like this: >> group=anything, type=NAME, name=John(etc...) >> >> perhaps a ColumnQualifierGrouping iterator could be applied at scan time >> to add up the cardinalities for the quals over the given time range being >> scanned where cardinalities across different time units get aggregated >> client side. >> >> >> >> >> On Fri, May 16, 2014 at 5:19 PM, David Medinets <[email protected] >> > wrote: >> >>> Yes, the data has not yet been ingested. I can control the table >>> structure; hopefully by integrating (or extending) the D4M schema. >>> >>> I'm leaning towards using https://github.com/addthis/stream-lib as part >>> of the ingest process. Upon start up, existing tables would be analyzed to >>> find cardinality. Then as records are ingested, the cardinality would be >>> adjusted as needed. I don't yet know how to store the cardinality >>> information so that restarting the ingest process doesn't require >>> re-processing all the data. Still researching. >>> >>> >>> On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <[email protected]> wrote: >>> >>>> Can we assume this data has not yet been ingested? Do you have control >>>> over the way in which you structure your table? >>>> >>>> >>>> >>>> On Fri, May 16, 2014 at 1:54 PM, David Medinets < >>>> [email protected]> wrote: >>>> >>>>> If I have the following simple set of data: >>>>> >>>>> NAME John >>>>> NAME Jake >>>>> NAME John >>>>> NAME Mary >>>>> >>>>> I want to end up with the following: >>>>> >>>>> NAME 3 >>>>> >>>>> I'm thinking that perhaps a HyperLogLog approach should work. See >>>>> http://en.wikipedia.org/wiki/HyperLogLog for more information. >>>>> >>>>> Has anyone done this before in Accumulo? >>>>> >>>> >>>> >>> >> >
