Re: CountMinSketch with a different function

Matt Abrams Thu, 10 Apr 2014 07:13:13 -0700

If'd you'd like to share a sample of your data I'd be happy to build a
template Hydra job with the tree Ian described.


On Thu, Apr 10, 2014 at 9:53 AM, Ian Barfield <[email protected]> wrote:
> While we are talking about CMS though, I'll mention that I suggest using the
> ConservativeAdd variant I created a little while back. It is about an order
> of magnitude more accurate in many cases, in exchange for a few more integer
> comparisons.
>
> I suspect the extra CPU use is insignificant and could be regained in any
> event by using a smaller CMS - which you can afford to do with the increased
> accuracy. The only down side is that the theoretical accuracy is not well
> understood other than "better than standard" and experimentally "much
> better", so it uses the same sizing calculations as the base version by
> default.
>
> On Apr 10, 2014 9:43 AM, "Ian Barfield" <[email protected]> wrote:
>>
>> My point about the keys is that you cannot easily find which movie is
>> trending in which category. You'd have to do a look up of every movie title
>> in order to get their scores.
>>
>> For global counts, unless a movie changes categories randomly... The
>> counts should be the same for that key in every category (+/- estimation
>> accuracy). You can get the same result by just querying a single category
>> that movie belongs to.
>>
>> It sounds like what you are saying is that you want to merge the
>> categories and then do an iterative lookup over a bunch of keys. CMS is more
>> useful in the case where the cardinality of keys is huge, or the number of
>> CMS objects you need is very large. The easiest way to get what you want
>> though may be to keep a 'master' CMS - eg. a fake category that every movie
>> belongs to. TopK is probably more appropriate though.
>>
>> In Hydra (a data processing platform we developed that uses stream-lib), I
>> would simply make a tree with a master root node, and a child node for every
>> category. Then I would specify a topK attachment for each, based on movie
>> title. Probably also would throw in a cardinality estimator. This would
>> likely be able to process millions of events a second and return query
>> lookups in about a millisecond.
>>
>> The queries would look like:
>> root/+%topmovies
>>
>> root/categories/romcom/+%topmovies
>>
>> and so forth. Can find some more info here if that sounds convenient.
>>
>> https://www.addthis.com/blog/2014/02/18/getting-started-with-hydra
>>
>> On Apr 10, 2014 5:20 AM, "Sumanth N" <[email protected]> wrote:
>>>
>>> Thanks Ian for the reply. the use case is , I have a set of movies which
>>> belongs to multiple genres. When some one views a movie, an update is made
>>> in all the categories(genres). So I could easily find which movie is
>>> trending in which genre, If I would want to combine these if I do additive
>>> merge it would end up counting multiple times. For global count (global
>>> topK) I would simply take the max from all the genres for a movie and get
>>> the top10. Hope the usage makes sense.
>>> Regarding using topK (StreamSummary), I am still understanding &
>>> validating.
>>>
>>> Regards,
>>> Sumanth
>>>
>>>
>>> On Wednesday, April 9, 2014 12:11:09 AM UTC+5:30, Ian Barfield wrote:
>>>>
>>>> I am not entirely sure I understand your use case. If you are trying to
>>>> do topK, you will have to keep the keys around anyway. Why wouldn't you 
>>>> just
>>>> use a topK estimator? I also don't understand why you wouldn't want an
>>>> additive merge in this case. Maybe that is related to this topK interaction
>>>> that I am missing?
>>>>
>>>> That aside, I am pretty sure you can get the min/max behavior you want
>>>> by simply querying each CMS and min/maxing the response. I left most of the
>>>> internals package private because it is not neatly organized for
>>>> sub-classing, nor has clear contracts for what will or will not break 
>>>> custom
>>>> sub-classes in the future. It is certainly possible to make a class in that
>>>> package and use it in your application though.
>>>>
>>>>
>>>> On Mon, Apr 7, 2014 at 10:34 AM, <[email protected]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> When CMS is merged, the table is updated with the counts added from all
>>>>> the estimators.
>>>>> In my use case I think, it could be either min or max function of the
>>>>> estimated counts from different estimators to be used.
>>>>>
>>>>> Use case
>>>>> there are n-categories & for each category there is CMS for estimation
>>>>> of topK in the category;
>>>>> Additionally I need to find topK for all the items across categories;
>>>>> using merge it could result in double counting,
>>>>> instead I would like to use min or max.
>>>>>
>>>>> I have tried to extend the existing CountMinSketch class to add a new
>>>>> combine function which will take min / max in merge() call.
>>>>> Alas, all the required variables are package private and I couldn't
>>>>> make it work.
>>>>>
>>>>> Do let me know if it is correct to use different functions in merging.
>>>>>
>>>>> Thanks,
>>>>> Sumanth
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "stream-lib-user" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "stream-lib-user" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "stream-lib-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"stream-lib-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: CountMinSketch with a different function

Reply via email to