If'd you'd like to share a sample of your data I'd be happy to build a template Hydra job with the tree Ian described.
On Thu, Apr 10, 2014 at 9:53 AM, Ian Barfield <[email protected]> wrote: > While we are talking about CMS though, I'll mention that I suggest using the > ConservativeAdd variant I created a little while back. It is about an order > of magnitude more accurate in many cases, in exchange for a few more integer > comparisons. > > I suspect the extra CPU use is insignificant and could be regained in any > event by using a smaller CMS - which you can afford to do with the increased > accuracy. The only down side is that the theoretical accuracy is not well > understood other than "better than standard" and experimentally "much > better", so it uses the same sizing calculations as the base version by > default. > > On Apr 10, 2014 9:43 AM, "Ian Barfield" <[email protected]> wrote: >> >> My point about the keys is that you cannot easily find which movie is >> trending in which category. You'd have to do a look up of every movie title >> in order to get their scores. >> >> For global counts, unless a movie changes categories randomly... The >> counts should be the same for that key in every category (+/- estimation >> accuracy). You can get the same result by just querying a single category >> that movie belongs to. >> >> It sounds like what you are saying is that you want to merge the >> categories and then do an iterative lookup over a bunch of keys. CMS is more >> useful in the case where the cardinality of keys is huge, or the number of >> CMS objects you need is very large. The easiest way to get what you want >> though may be to keep a 'master' CMS - eg. a fake category that every movie >> belongs to. TopK is probably more appropriate though. >> >> In Hydra (a data processing platform we developed that uses stream-lib), I >> would simply make a tree with a master root node, and a child node for every >> category. Then I would specify a topK attachment for each, based on movie >> title. Probably also would throw in a cardinality estimator. This would >> likely be able to process millions of events a second and return query >> lookups in about a millisecond. >> >> The queries would look like: >> root/+%topmovies >> >> root/categories/romcom/+%topmovies >> >> and so forth. Can find some more info here if that sounds convenient. >> >> https://www.addthis.com/blog/2014/02/18/getting-started-with-hydra >> >> On Apr 10, 2014 5:20 AM, "Sumanth N" <[email protected]> wrote: >>> >>> Thanks Ian for the reply. the use case is , I have a set of movies which >>> belongs to multiple genres. When some one views a movie, an update is made >>> in all the categories(genres). So I could easily find which movie is >>> trending in which genre, If I would want to combine these if I do additive >>> merge it would end up counting multiple times. For global count (global >>> topK) I would simply take the max from all the genres for a movie and get >>> the top10. Hope the usage makes sense. >>> Regarding using topK (StreamSummary), I am still understanding & >>> validating. >>> >>> Regards, >>> Sumanth >>> >>> >>> On Wednesday, April 9, 2014 12:11:09 AM UTC+5:30, Ian Barfield wrote: >>>> >>>> I am not entirely sure I understand your use case. If you are trying to >>>> do topK, you will have to keep the keys around anyway. Why wouldn't you >>>> just >>>> use a topK estimator? I also don't understand why you wouldn't want an >>>> additive merge in this case. Maybe that is related to this topK interaction >>>> that I am missing? >>>> >>>> That aside, I am pretty sure you can get the min/max behavior you want >>>> by simply querying each CMS and min/maxing the response. I left most of the >>>> internals package private because it is not neatly organized for >>>> sub-classing, nor has clear contracts for what will or will not break >>>> custom >>>> sub-classes in the future. It is certainly possible to make a class in that >>>> package and use it in your application though. >>>> >>>> >>>> On Mon, Apr 7, 2014 at 10:34 AM, <[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> When CMS is merged, the table is updated with the counts added from all >>>>> the estimators. >>>>> In my use case I think, it could be either min or max function of the >>>>> estimated counts from different estimators to be used. >>>>> >>>>> Use case >>>>> there are n-categories & for each category there is CMS for estimation >>>>> of topK in the category; >>>>> Additionally I need to find topK for all the items across categories; >>>>> using merge it could result in double counting, >>>>> instead I would like to use min or max. >>>>> >>>>> I have tried to extend the existing CountMinSketch class to add a new >>>>> combine function which will take min / max in merge() call. >>>>> Alas, all the required variables are package private and I couldn't >>>>> make it work. >>>>> >>>>> Do let me know if it is correct to use different functions in merging. >>>>> >>>>> Thanks, >>>>> Sumanth >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "stream-lib-user" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> For more options, visit https://groups.google.com/d/optout. >>>> >>>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "stream-lib-user" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "stream-lib-user" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "stream-lib-user" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
