Thanks Will. Please find my reply in-line below. But just to stay in line with my original question of a sketch for additive metrics, is that I can use such a sketch for on-the-fly aggregation by storing one such sketch per "dimension=value" pair without having to go to the table for aggregation.
I would separately like to share with Lee, you or whomever you nominate to explain the experiments that I have performed. You may find my learnings very interesting. Really hard to explain in words. Perhaps a call. Am more than happy to share code and the thought process. On Mon, May 2, 2022 at 10:10 PM Will Lauer <wla...@yahooinc.com> wrote: > OK, this is interesting. I've got some concerns and questions that I've > put inline below. > > Will > > > > Will Lauer > > > Senior Principal Architect, Audience & Advertising Reporting > > Data Platforms & Systems Engineering > > > M 508 561 6427 > > Champaign Office > > 1908 S. First St > > Champaign, IL 61822 > > > > On Mon, May 2, 2022 at 9:19 AM vijay rajan <vijay.sankar.ra...@gmail.com> > wrote: > >> Thanks Lee. Please find my answers inline below in blue. I think you >> will find my use case very interesting. My next endeavor would be to make a >> decision tree with entropy / gini impurity measures with sketches. I am >> amazed at some of the results I have gotten. You may find this quite >> interesting. >> >> On Sun, May 1, 2022 at 12:55 AM leerho <lee...@gmail.com> wrote: >> >>> Vijay, >>> Sorry about the delay in getting back to you. >>> >>> There is some critical information missing from your description and >>> that is the domain of what you are sketching. >>> I presume that it is User-IDs, otherwise it doesn't make sense. >>> If this is the case I think the solution can be achieved in a couple of >>> ways. Going forward with this assumption: >>> >> >>> Your raw data would look something like this: >>> >>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} >>> >>> >>> >> Actually it would be an event_id instead of user_Id. The domain for the >> data is a hypothetical online advertising summary. >> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} >> >> The problem that we faced in the past was that it was impossible to >> "join" a click to an impression. While the raw impression data would have a >> unique event_id the raw click data would also have its own event_id. The >> dimensions were the same for clicks and impressions. What we did back in >> thge day was aggregate the data for impressions separately and then >> aggregate for clicks separately and then join. >> >> You could assume the same here. Each dimension aggregate tuple {AgeGroup, >> Gender and Country in out case} would have a theta sketch for impressions >> and another for clicks. >> > > I don't see what a "theta sketch for impressions and another for clicks" > would mean here. Theta sketches are for doing distinct counts, so I could > understand a theta sketch for EventId, counting the number of distinct > event ids associated with the AgeGroup/Gender/Country combination. But that > doesn't seem to make sense for clicks or impressions. > Just as a background, back in the day the only way to join aggregates for clicks and aggregates for impressions was via the dimensions. What this means is that a tuple in impressions aggregation table with AgeGroup=30-40, Gender=F, Country=UK would be joined with the tuple AgeGroup=30-40, Gender=F, Country=UK in the clicks aggregation table for the same hour. There was no way to find the exact impression and its corresponding click. So the schema for the click stream would be {EventId, AgeGroup, Gender, Country} while the schema for the impression stream would be identical {EventId, AgeGroup, Gender, Country}. Note that the EventId in click has no relationship with EventId of impressions. > If your original data was {EventId, AgeGroup, Gender, Country, > Impressions, Clicks, AdSpend}, it implies to me that clicks and > impressions would always be 1 (that is, an event is either a single click > or single impression). When you drop the EventId dimension, the natural > aggregation for these (as well as for AdSpend) is summation, giving you the > total clicks, total impressions (and total AdSpend) for the > Age/Gender/Country combination. If you were to generate a Theta sketch, it > seems like it would be non-sensical, as you all impressions or clicks > inserted would be the same value and the theta sketch would always return > "1". > >> >> >> > Let me try to explain it this way, the EventId is a distinct id for an impression or a click. Look at it as a UUID of some sort. While building a theta sketch for the impression stream for each of the dimension=value combination I use the eventId to update the sketch. Example: One Theta sketch for AgeGroup=x1 another theta sketch AgeGroup=x2 and so on. If you open this file https://tinyurl.com/2p8kwj3m <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> it will make sense. It is nothing but a "dimension=value,Base64 Theta serialized theta sketch". > A proposed solution would be to configure a TupleSketch with 3 fields: > (Impressions, >>> Clicks, AdSpend). >>> (This requires a little programming. You would need to create classes >>> that implement the interfaces Summary, >>> SummaryFactory and SummarySetOperations. There are examples of how to >>> do this on the website.) >>> >>> >> I will look into Tuple sketch and the SummaryFactory and Summary Set >> operations. I did try reading it but I cannot say I understood this >> completely. What type of sketch would we use for AdSpend? Would it be >> Quantile? >> > >> >> >>> Then you would process your raw data as follows: >>> >>> - Partition your raw data into 213 streams defined by the three >>> dimensions: >>> - 10 X AgeGroup >>> - 3 X Gender >>> - 200 X Country >>> - Each stream would feed a configured TupleSketch as above. (so you >>> need only 213 sketches). >>> - These sketches are in the form of a 3D hypercube. >>> - Each input tuple would feed 3 sketches, one in each dimension. >>> >>> >> I believe I did something similar. Let me try explaining this with >> concrete examples >> >> 1. This is the raw data >> >> <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please >> click on the link https://tinyurl.com/3d2r7xjh >> >> <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$> >> ) for a company in India that provides car drivers for hire. The data has >> most columns as categorical variables except the first which is tripId. >> The >> schema looks something like this >> 1. { >> *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type, >> *tripOutcome* } >> 2. tripId the first column, is distinct per row which I put into >> theta sketches as shown later >> 3. tripOutcome the last column, is a classification variable which >> has values "bad or good" >> 4. The problem statement is to dod data analysis of combinations >> of dimensions that produce an unfavourable result that is >> "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high >> 2. I was able to populate theta sketches for each dimension and value >> pair (example "city=Mumbai") and store them as my effective data >> representation. >> 1. with 10 as base and hence nominal entries as 2^10 the data is >> https://tinyurl.com/bkvz4xrh >> >> <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$> >> >> 2. with 12 as base and hence nominal entries as 2^12 the data is >> https://tinyurl.com/f9mzm8ef >> >> <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$> >> >> 3. with 14 as base and hence nominal entries as 2^14 the data is >> https://tinyurl.com/2p8kwj3m >> >> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> >> 4. The data looks like this >> 1. dimension=value, Base64 encoded Theta sketch formed from >> unique tripId >> >> I'm not sure what you are sketching here. What is the entry being > inserted into the sketch? > The distinct event id is inserted in the sketch. > Basically, what are you counting? > I would like to estimate the distinct number of impressions for any combination example "AgeGroup=x1 & Gender=M" how many impressions were served. Normally, I would think of sketching as taking your original data ( tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type, > tripOutcome ) and picking a column (or columns) that you want to count > the distinct values of and generating a new table that contains the > dimensions you care about plus that new sketch. > I am actually taking it a notch up. I am using theta sketch as a BitVector to hold an estimate of the number of clicks (or number of impressions) and using it to do data mining and association rule mining. I would be thrilled to get on a call and explain. Please feel free to schedule a call from 10:00 pm to 12:00 pm IST on any weekday. I am in Bangalore. > In this case, if you are counting trips, the resulting table would look > like {source, estimate_usage_bins, city, zone, zone_demand_popularity, > dayType, pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. > You could then report on a variety of combinations by dropping dimensions > and aggregating (merging) the sketches. > The best example of what I am trying to achieve is https://tinyurl.com/2p8kwj3m <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> . > >> 1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT >> https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17 >> >> <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$> >> ] to generate a combination of "dimension=value" with a minimum support >> threshold. The input to this implementation takes the files above at point >> 2. https://tinyurl.com/2p8kwj3m >> >> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> >> 2. I produced frequent itemsets for both sides i.e tripOutcome=bad >> and tripOutcome=good and then join the results. The answers I get are >> listed below >> 1. with input having dimension=value and theta sketch with 2^10 >> nominal entries answers can be found here >> https://tinyurl.com/yc3a3rde >> >> <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$> >> >> 2. with input having dimension=value and theta sketch with 2^12 >> nominal entries answers can be found here >> https://tinyurl.com/5n7jjr32 >> >> <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$> >> 3. with input having dimension=value and theta sketch with 2^14 >> nominal entries answers can be found here >> https://tinyurl.com/3wrxp4a7 >> >> <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$> >> 3. The results show that 2^14 (16K) nominal sketches give immaculate >> results. There are errors in the results from the other sketches (RMSE). >> 4. Sample of results >> 1. >> >> *level of combination, rule for combination, bad count, good >> count, percentage of bad count* >> 2. >> >> 4,booking_type=one_way_trip & city=Pune & sla=Scheduled & >> source=android,26.0,9.0,35.0,74.0 >> >> 3,booking_type=one_way_trip & city=Pune & >> source=android,26.0,10.0,36.0,72.0 >> >> 5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled & >> source=ios & >> zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0 >> >> 4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android & >> zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0 >> >> 5,booking_type=one_way_trip & city=Delhi NCR & >> pickUpHourOfDay=VERYLATE & source=ios & >> zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0 >> >> 3,booking_type=one_way_trip & sla=Scheduled & >> zone=205,14.0,9.0,23.0,61.0 >> >> I am aware from one of Lee's recent youtube videos that theta sketches >> work like Sets(Bit sets) when large enough and the data is small. My use >> case is fine with a count of 100 being reported as 80 or 120 (20%) but not >> much higher. I could use the jaccard formula explained by Lee to get the >> margin of error. >> >> >> Sorry for the details but this should help you suggest if what I am doing >> is indeed what tuple sketches does on its own. I am using update sketches >> for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v >> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>* >> >> >> >> >>> A query would choose the desired coordinates of the 3 dimensions. >>> If the query selects more than one coordinate from any one dimension, >>> then you would need to first merge the corresponding coordinate sketches >>> of that dimension together. >>> A missing dimension would mean first merging all coordinate sketches of >>> that dimension together. >>> The final result sketch is an Intersection of the resulting 3 >>> dimensional sketches. >>> >> >> Yes. the code here should serve as an example >> https://tinyurl.com/4pzfsj6v >> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$> >> >> >> >>> >>> With the resulting Tuple sketch you iterate over all the retained items >>> (Summaries) and sum each of the 3 fields. >>> Then you divide each sum by Theta. The result is the estimate of >>> Impressions, Clicks, and AdSpend >>> for the population of users selected by the query. The size of that >>> population would be obtained by getEstimate(). >>> >>> >> My question remains. Which base sketch type should I use for an additive >> metric like Ad-spend? Would that be a *quantile sketch*? >> > > I still think this question doesn't make sense. What are you trying to > understand with the sketch? Are you trying to find out the distribution of > adspend for a particular set of dimension values? Then that sounds like a > quantile sketch, Are you trying to find out how many different values of > adspend there were? That's a distinct counting sketch. Are you trying to > find out the most common ad spend values? That would be a frequent items > sketch. > > But really, none of those actually seem to match what you described above. > What you described above sounded like you were simply trying to calculate > the total AdSpend, which just requires sum, not a sketch. > In the big data world where one has millions of rows of event data, if we could approximate impressions and clicks by having a sketch for every dimension=value pair, we would make analytics very fast though approximate. > >> >> >>> Be aware that Intersections, by their nature, can increase error >>> significantly. (see the Website). >>> Try to avoid doing intersections with sketch coordinates that have very >>> small populations, if possible. >>> >> >> Yes. However if I use very high nominal entries like 2^20 shouldn't this >> get mitigated? >> > > The point of using a sketch is estimation. If you can't tolerate any > error, you shouldn't be using the sketches in the first place. don't just > assume that you SHOULD use a high nominal entries. That said, when using > theta sketches and doing intersections, we do sometimes use values like > 2^20 for nominal entries, but often it is overkill. Remember, in the case > of intersections, you get high error when the intersection size is small > compared with the overall set sizes, The case where you see high relative > error are the cases where you are looking at small intersections. In most > use cases, a relative error of 1,000% or even 10,000% on an intersection > size of 10 when comparing sets of a billion is still a reasonable error > given how small the _actual_ value is. > It's not a zero or one thing. I can tolerate a few frequent item sets being off by 5% but not more than that. I have understood the LowerBound and UpperBound mechanism explained by Lee. > >> >> >>> You can get a sense of how good your results are by utilizing the >>> getUpperBound() and getLowerBound() >>> (with respect to the user population.). The bigger that spread is, the >>> worse your estimates will be. >>> >>> Thanks. This is perfect. I do not need to use the Jaccard to get my >> estimate >> >> >> >>> The other option would be to do the cross-product upfront and create >>> 6000 Tuple sketches. >>> In this case the input tuple would feed only one sketch and the query >>> would be to only one sketch. >>> >> >> Since I am not using SQL queries and have my own unique use case, my >> questions are quite different. >> >> >>> This can quickly get unwieldy with more dimensions or high cardinality >>> dimensions. >>> It is more accurate because it avoids the intersections. It is also >>> possible to do a mix of >>> these two approaches, but you would need to think it through carefully. >>> >>> I hope this helps. >>> >>> Lee. >>> >>> >>> >>> >>> >>> >>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan < >>> vijay.sankar.ra...@gmail.com> wrote: >>> >>>> Hi, >>>> Just like theta sketches are used for distinct count metrics like >>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be >>>> used for metrics like ad_spend? If so, what are the error bounds? >>>> >>>> There is a big opportunity that I see in storing very little data in >>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic >>>> data very fast (although it is approximate). >>>> >>>> This question is best explained with an example. Say I have a fact >>>> table schema as follows >>>> >>>> >>>> 1. *Age_Groups* is a dimension with 10 bucketed distinct values >>>> {less than 18, 19-25, 26-35....} >>>> 2. *Gender* is a dimension with 3 distinct values F, M & Unknown >>>> 3. *Country* is a dimension with 200 possible values >>>> 4. *Impressions* is a metric which is a long count (perhaps with >>>> Theta sketch or HLL) >>>> 5. *Clicks* is a metric which is a long count (perhaps with Theta >>>> sketch or HLL) >>>> 6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps >>>> with quantile sketch??*) >>>> >>>> >>>> The maximum possible number of entries in this table would be the cross >>>> product of the cardinalities of the dimensions which is 10(Age_Group) x >>>> 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I >>>> could store only (10 + 3 + 200) * *3 sketches(**one each for >>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the >>>> group by queries using set operations that sketches provides. >>>> >>>> For impression metric, one sketch for Gender=F, another sketch for >>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as >>>> well. >>>> For click metric, one sketch for Gender=F, another sketch for Gender=M, >>>> yet another for Gender=Unknown and so on for other dimensions as well. >>>> Question is what sketch would I need for ad-spend? >>>> >>>> On a side note, I have used Theta sketches in this manner and even >>>> implemented ECLAT for frequent itemset computations and association rule >>>> mining ... example from another project below with theta sketches used for >>>> count, I do not know how to do the same for a metric like Ad-Spend. >>>> >>>> Level Conjunction Count >>>> 2 H10 & MemberGender=F 74 >>>> 2 M15 & MemberGender=F 83 >>>> 2 G31 & MemberGender=F 59 >>>> 2 MemberGender=F & R13 66 >>>> *In the example above, H10, M15 etc are International Disease codes for >>>> specific diseases. *https://www.aapc.com/codes/code-search/ >>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$> >>>> >>>> Hope this is a clear representation of the problem. >>>> >>>> Regards >>>> Vijay >>>> >>>> >>>> >>>> >>>>