Hi Will, Thanks for your response. I will send my clarifications in a day or two. Please do look at my detailed explanation & look at the datasets and results that I have shared. You should understand what I am trying to do.
Essentially, an event_Id is a uuid for an event. A click stream will have an event id as will an impression stream. Just as a background, when I was a Yahoo in 2014, I had explained a problem with a requirement for approximate distribution to Lee who explained quantile sketches. I have used HLL & theta sketches for quite some time now. Changing the need from distinct counts of user ids to using it as an "appropriate set" with event ids is what I am looking at(asking in this case). Regarding ad spend, it is to use a sketch with distribution & approximations including getting additive sums from the sketch. Happy to discuss further on a call. Vijay On Mon, May 2, 2022, 22:10 Will Lauer <wla...@yahooinc.com> wrote: > OK, this is interesting. I've got some concerns and questions that I've > put inline below. > > Will > > > > Will Lauer > > > Senior Principal Architect, Audience & Advertising Reporting > > Data Platforms & Systems Engineering > > > M 508 561 6427 > > Champaign Office > > 1908 S. First St > > Champaign, IL 61822 > > > > On Mon, May 2, 2022 at 9:19 AM vijay rajan <vijay.sankar.ra...@gmail.com> > wrote: > >> Thanks Lee. Please find my answers inline below in blue. I think you >> will find my use case very interesting. My next endeavor would be to make a >> decision tree with entropy / gini impurity measures with sketches. I am >> amazed at some of the results I have gotten. You may find this quite >> interesting. >> >> On Sun, May 1, 2022 at 12:55 AM leerho <lee...@gmail.com> wrote: >> >>> Vijay, >>> Sorry about the delay in getting back to you. >>> >>> There is some critical information missing from your description and >>> that is the domain of what you are sketching. >>> I presume that it is User-IDs, otherwise it doesn't make sense. >>> If this is the case I think the solution can be achieved in a couple of >>> ways. Going forward with this assumption: >>> >> >>> Your raw data would look something like this: >>> >>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} >>> >>> >>> >> Actually it would be an event_id instead of user_Id. The domain for the >> data is a hypothetical online advertising summary. >> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} >> >> The problem that we faced in the past was that it was impossible to >> "join" a click to an impression. While the raw impression data would have a >> unique event_id the raw click data would also have its own event_id. The >> dimensions were the same for clicks and impressions. What we did back in >> thge day was aggregate the data for impressions separately and then >> aggregate for clicks separately and then join. >> >> You could assume the same here. Each dimension aggregate tuple {AgeGroup, >> Gender and Country in out case} would have a theta sketch for impressions >> and another for clicks. >> > > I don't see what a "theta sketch for impressions and another for clicks" > would mean here. Theta sketches are for doing distinct counts, so I could > understand a theta sketch for EventId, counting the number of distinct > event ids associated with the AgeGroup/Gender/Country combination. But that > doesn't seem to make sense for clicks or impressions. If your original data > was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it > implies to me that clicks and impressions would always be 1 (that is, an > event is either a single click or single impression). When you drop the > EventId dimension, the natural aggregation for these (as well as for > AdSpend) is summation, giving you the total clicks, total impressions (and > total AdSpend) for the Age/Gender/Country combination. If you were to > generate a Theta sketch, it seems like it would be non-sensical, as you all > impressions or clicks inserted would be the same value and the theta sketch > would always return "1". > >> >> >> >>> A proposed solution would be to configure a TupleSketch with 3 fields: >>> (Impressions, >>> Clicks, AdSpend). >>> (This requires a little programming. You would need to create classes >>> that implement the interfaces Summary, >>> SummaryFactory and SummarySetOperations. There are examples of how to >>> do this on the website.) >>> >>> >> I will look into Tuple sketch and the SummaryFactory and Summary Set >> operations. I did try reading it but I cannot say I understood this >> completely. What type of sketch would we use for AdSpend? Would it be >> Quantile? >> > >> >> >>> Then you would process your raw data as follows: >>> >>> - Partition your raw data into 213 streams defined by the three >>> dimensions: >>> - 10 X AgeGroup >>> - 3 X Gender >>> - 200 X Country >>> - Each stream would feed a configured TupleSketch as above. (so you >>> need only 213 sketches). >>> - These sketches are in the form of a 3D hypercube. >>> - Each input tuple would feed 3 sketches, one in each dimension. >>> >>> >> I believe I did something similar. Let me try explaining this with >> concrete examples >> >> 1. This is the raw data >> >> <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please >> click on the link https://tinyurl.com/3d2r7xjh >> >> <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$> >> ) for a company in India that provides car drivers for hire. The data has >> most columns as categorical variables except the first which is tripId. >> The >> schema looks something like this >> 1. { >> *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type, >> *tripOutcome* } >> 2. tripId the first column, is distinct per row which I put into >> theta sketches as shown later >> 3. tripOutcome the last column, is a classification variable which >> has values "bad or good" >> 4. The problem statement is to dod data analysis of combinations >> of dimensions that produce an unfavourable result that is >> "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high >> 2. I was able to populate theta sketches for each dimension and value >> pair (example "city=Mumbai") and store them as my effective data >> representation. >> 1. with 10 as base and hence nominal entries as 2^10 the data is >> https://tinyurl.com/bkvz4xrh >> >> <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$> >> >> 2. with 12 as base and hence nominal entries as 2^12 the data is >> https://tinyurl.com/f9mzm8ef >> >> <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$> >> >> 3. with 14 as base and hence nominal entries as 2^14 the data is >> https://tinyurl.com/2p8kwj3m >> >> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> >> 4. The data looks like this >> 1. dimension=value, Base64 encoded Theta sketch formed from >> unique tripId >> >> I'm not sure what you are sketching here. What is the entry being > inserted into the sketch? Basically, what are you counting? Normally, I > would think of sketching as taking your original data ( > tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type, > tripOutcome ) and picking a column (or columns) that you want to count > the distinct values of and generating a new table that contains the > dimensions you care about plus that new sketch. In this case, if you are > counting trips, the resulting table would look like {source, > estimate_usage_bins, city, zone, zone_demand_popularity, dayType, > pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. You could > then report on a variety of combinations by dropping dimensions and > aggregating (merging) the sketches. > >> >> 1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT >> https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17 >> >> <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$> >> ] to generate a combination of "dimension=value" with a minimum support >> threshold. The input to this implementation takes the files above at point >> 2. https://tinyurl.com/2p8kwj3m >> >> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> >> 2. I produced frequent itemsets for both sides i.e tripOutcome=bad >> and tripOutcome=good and then join the results. The answers I get are >> listed below >> 1. with input having dimension=value and theta sketch with 2^10 >> nominal entries answers can be found here >> https://tinyurl.com/yc3a3rde >> >> <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$> >> >> 2. with input having dimension=value and theta sketch with 2^12 >> nominal entries answers can be found here >> https://tinyurl.com/5n7jjr32 >> >> <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$> >> 3. with input having dimension=value and theta sketch with 2^14 >> nominal entries answers can be found here >> https://tinyurl.com/3wrxp4a7 >> >> <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$> >> 3. The results show that 2^14 (16K) nominal sketches give immaculate >> results. There are errors in the results from the other sketches (RMSE). >> 4. Sample of results >> 1. >> >> *level of combination, rule for combination, bad count, good >> count, percentage of bad count* >> 2. >> >> 4,booking_type=one_way_trip & city=Pune & sla=Scheduled & >> source=android,26.0,9.0,35.0,74.0 >> >> 3,booking_type=one_way_trip & city=Pune & >> source=android,26.0,10.0,36.0,72.0 >> >> 5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled & >> source=ios & >> zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0 >> >> 4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android & >> zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0 >> >> 5,booking_type=one_way_trip & city=Delhi NCR & >> pickUpHourOfDay=VERYLATE & source=ios & >> zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0 >> >> 3,booking_type=one_way_trip & sla=Scheduled & >> zone=205,14.0,9.0,23.0,61.0 >> >> I am aware from one of Lee's recent youtube videos that theta sketches >> work like Sets(Bit sets) when large enough and the data is small. My use >> case is fine with a count of 100 being reported as 80 or 120 (20%) but not >> much higher. I could use the jaccard formula explained by Lee to get the >> margin of error. >> >> >> Sorry for the details but this should help you suggest if what I am doing >> is indeed what tuple sketches does on its own. I am using update sketches >> for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v >> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>* >> >> >> >> >>> A query would choose the desired coordinates of the 3 dimensions. >>> If the query selects more than one coordinate from any one dimension, >>> then you would need to first merge the corresponding coordinate sketches >>> of that dimension together. >>> A missing dimension would mean first merging all coordinate sketches of >>> that dimension together. >>> The final result sketch is an Intersection of the resulting 3 >>> dimensional sketches. >>> >> >> Yes. the code here should serve as an example >> https://tinyurl.com/4pzfsj6v >> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$> >> >> >> >>> >>> With the resulting Tuple sketch you iterate over all the retained items >>> (Summaries) and sum each of the 3 fields. >>> Then you divide each sum by Theta. The result is the estimate of >>> Impressions, Clicks, and AdSpend >>> for the population of users selected by the query. The size of that >>> population would be obtained by getEstimate(). >>> >>> >> My question remains. Which base sketch type should I use for an additive >> metric like Ad-spend? Would that be a *quantile sketch*? >> > > I still think this question doesn't make sense. What are you trying to > understand with the sketch? Are you trying to find out the distribution of > adspend for a particular set of dimension values? Then that sounds like a > quantile sketch, Are you trying to find out how many different values of > adspend there were? That's a distinct counting sketch. Are you trying to > find out the most common ad spend values? That would be a frequent items > sketch. > > But really, none of those actually seem to match what you described above. > What you described above sounded like you were simply trying to calculate > the total AdSpend, which just requires sum, not a sketch. > >> >> >> >>> Be aware that Intersections, by their nature, can increase error >>> significantly. (see the Website). >>> Try to avoid doing intersections with sketch coordinates that have very >>> small populations, if possible. >>> >> >> Yes. However if I use very high nominal entries like 2^20 shouldn't this >> get mitigated? >> > > The point of using a sketch is estimation. If you can't tolerate any > error, you shouldn't be using the sketches in the first place. don't just > assume that you SHOULD use a high nominal entries. That said, when using > theta sketches and doing intersections, we do sometimes use values like > 2^20 for nominal entries, but often it is overkill. Remember, in the case > of intersections, you get high error when the intersection size is small > compared with the overall set sizes, The case where you see high relative > error are the cases where you are looking at small intersections. In most > use cases, a relative error of 1,000% or even 10,000% on an intersection > size of 10 when comparing sets of a billion is still a reasonable error > given how small the _actual_ value is. > >> >> >> >>> You can get a sense of how good your results are by utilizing the >>> getUpperBound() and getLowerBound() >>> (with respect to the user population.). The bigger that spread is, the >>> worse your estimates will be. >>> >>> Thanks. This is perfect. I do not need to use the Jaccard to get my >> estimate >> >> >> >>> The other option would be to do the cross-product upfront and create >>> 6000 Tuple sketches. >>> In this case the input tuple would feed only one sketch and the query >>> would be to only one sketch. >>> >> >> Since I am not using SQL queries and have my own unique use case, my >> questions are quite different. >> >> >>> This can quickly get unwieldy with more dimensions or high cardinality >>> dimensions. >>> It is more accurate because it avoids the intersections. It is also >>> possible to do a mix of >>> these two approaches, but you would need to think it through carefully. >>> >>> I hope this helps. >>> >>> Lee. >>> >>> >>> >>> >>> >>> >>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan < >>> vijay.sankar.ra...@gmail.com> wrote: >>> >>>> Hi, >>>> Just like theta sketches are used for distinct count metrics like >>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be >>>> used for metrics like ad_spend? If so, what are the error bounds? >>>> >>>> There is a big opportunity that I see in storing very little data in >>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic >>>> data very fast (although it is approximate). >>>> >>>> This question is best explained with an example. Say I have a fact >>>> table schema as follows >>>> >>>> >>>> 1. *Age_Groups* is a dimension with 10 bucketed distinct values >>>> {less than 18, 19-25, 26-35....} >>>> 2. *Gender* is a dimension with 3 distinct values F, M & Unknown >>>> 3. *Country* is a dimension with 200 possible values >>>> 4. *Impressions* is a metric which is a long count (perhaps with >>>> Theta sketch or HLL) >>>> 5. *Clicks* is a metric which is a long count (perhaps with Theta >>>> sketch or HLL) >>>> 6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps >>>> with quantile sketch??*) >>>> >>>> >>>> The maximum possible number of entries in this table would be the cross >>>> product of the cardinalities of the dimensions which is 10(Age_Group) x >>>> 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I >>>> could store only (10 + 3 + 200) * *3 sketches(**one each for >>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the >>>> group by queries using set operations that sketches provides. >>>> >>>> For impression metric, one sketch for Gender=F, another sketch for >>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as >>>> well. >>>> For click metric, one sketch for Gender=F, another sketch for Gender=M, >>>> yet another for Gender=Unknown and so on for other dimensions as well. >>>> Question is what sketch would I need for ad-spend? >>>> >>>> On a side note, I have used Theta sketches in this manner and even >>>> implemented ECLAT for frequent itemset computations and association rule >>>> mining ... example from another project below with theta sketches used for >>>> count, I do not know how to do the same for a metric like Ad-Spend. >>>> >>>> Level Conjunction Count >>>> 2 H10 & MemberGender=F 74 >>>> 2 M15 & MemberGender=F 83 >>>> 2 G31 & MemberGender=F 59 >>>> 2 MemberGender=F & R13 66 >>>> *In the example above, H10, M15 etc are International Disease codes for >>>> specific diseases. *https://www.aapc.com/codes/code-search/ >>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$> >>>> >>>> Hope this is a clear representation of the problem. >>>> >>>> Regards >>>> Vijay >>>> >>>> >>>> >>>> >>>>