Thanks Will. Please find my reply in-line below.
But just to stay in line with my original question of a sketch for additive
metrics, is that I can use such a sketch for on-the-fly aggregation by
storing one such sketch per "dimension=value" pair without having to go to
the table for aggregation.

I would separately like to share with Lee, you or whomever you nominate to
explain the experiments that I have performed. You may find my learnings
very interesting.

Really hard to explain in words. Perhaps a call. Am more than happy to
share code and the thought process.



On Mon, May 2, 2022 at 10:10 PM Will Lauer <wla...@yahooinc.com> wrote:

> OK, this is interesting. I've got some concerns and questions that I've
> put inline below.
>
> Will
>
>
>
> Will Lauer
>
>
> Senior Principal Architect, Audience & Advertising Reporting
>
> Data Platforms & Systems Engineering
>
>
> M 508 561 6427
>
> Champaign Office
>
> 1908 S. First St
>
> Champaign, IL 61822
>
>
>
> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vijay.sankar.ra...@gmail.com>
> wrote:
>
>> Thanks Lee. Please find my answers inline below in blue. I think you
>> will find my use case very interesting. My next endeavor would be to make a
>> decision tree with entropy / gini impurity measures with sketches. I am
>> amazed at some of the results I have gotten. You may find this quite
>> interesting.
>>
>> On Sun, May 1, 2022 at 12:55 AM leerho <lee...@gmail.com> wrote:
>>
>>> Vijay,
>>> Sorry about the delay in getting back to you.
>>>
>>> There is some critical information missing from your description and
>>> that is the domain of what you are sketching.
>>> I presume that it is User-IDs, otherwise it doesn't make sense.
>>> If this is the case I think the solution can be achieved in a couple of
>>> ways.  Going forward with this assumption:
>>>
>>
>>> Your raw data would look something like this:
>>>
>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>
>>>
>>>
>> Actually it would be an event_id instead of user_Id. The domain for the
>> data is a hypothetical online advertising summary.
>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>
>> The problem that we faced in the past was that it was impossible to
>> "join" a click to an impression. While the raw impression data would have a
>> unique event_id the raw click data would also have its own event_id. The
>> dimensions were the same for clicks and impressions. What we did back in
>> thge day was aggregate the data for impressions separately and then
>> aggregate for clicks separately and then join.
>>
>> You could assume the same here. Each dimension aggregate tuple {AgeGroup,
>> Gender and Country in out case} would have a theta sketch for impressions
>> and another for clicks.
>>
>
> I don't see what a "theta sketch for impressions and another for clicks"
> would mean here. Theta sketches are for doing distinct counts, so I could
> understand a theta sketch for EventId, counting the number of distinct
> event ids associated with the AgeGroup/Gender/Country combination. But that
> doesn't seem to make sense for clicks or impressions.
>

Just as a background, back in the day the only way to join aggregates for
clicks and aggregates for impressions was via the dimensions. What this
means is that a tuple in impressions aggregation table with AgeGroup=30-40,
Gender=F, Country=UK would be joined with the tuple AgeGroup=30-40,
Gender=F, Country=UK in the clicks aggregation table for the same hour.
There was no way to find the exact impression and its corresponding click.

So the schema for the click stream would be {EventId, AgeGroup, Gender,
Country} while the schema for the impression stream would be identical
{EventId, AgeGroup, Gender, Country}. Note that the EventId in click has no
relationship with EventId of impressions.


> If your original data was {EventId, AgeGroup, Gender, Country,
> Impressions, Clicks, AdSpend}, it implies to me that clicks and
> impressions would always be 1 (that is, an event is either a single click
> or single impression). When you drop the EventId dimension, the natural
> aggregation for these (as well as for AdSpend) is summation, giving you the
> total clicks, total impressions (and total AdSpend) for the
> Age/Gender/Country combination. If you were to generate a Theta sketch, it
> seems like it would be non-sensical, as you all impressions or clicks
> inserted would be the same value and the theta sketch would always return
> "1".
>
>>
>>
>>
>
Let me try to explain it this way, the EventId is a distinct id for an
impression or a click. Look at it as a UUID of some sort. While building a
theta sketch for the impression stream for each of the dimension=value
combination I use the eventId to update the sketch. Example:  One Theta
sketch for AgeGroup=x1 another theta sketch AgeGroup=x2 and so on. If you
open this file https://tinyurl.com/2p8kwj3m
<https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
it
will make sense. It is nothing but a "dimension=value,Base64 Theta
serialized theta sketch".



> A proposed solution would be to configure a TupleSketch with 3 fields:  
> (Impressions,
>>> Clicks, AdSpend).
>>> (This requires a little programming.  You would need to create classes
>>> that implement the interfaces Summary,
>>> SummaryFactory and SummarySetOperations.  There are examples of how to
>>> do this on the website.)
>>>
>>>
>> I will look into Tuple sketch and the SummaryFactory and Summary Set
>> operations. I did try reading it but I cannot say I understood this
>> completely. What type of sketch would we use for AdSpend? Would it be
>> Quantile?
>>
>
>>
>>
>>> Then you would process your raw data as follows:
>>>
>>>    - Partition your raw data into 213 streams defined by the three
>>>    dimensions:
>>>       - 10 X AgeGroup
>>>       - 3 X Gender
>>>       - 200 X Country
>>>    - Each stream would feed a configured TupleSketch as above. (so you
>>>    need only 213 sketches).
>>>    - These sketches are in the form of a 3D hypercube.
>>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>>
>>>
>> I believe I did something similar. Let me try explaining this with
>> concrete examples
>>
>>    1. This is the raw data
>>    
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>>    click on the link https://tinyurl.com/3d2r7xjh
>>    
>> <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>>    ) for a company in India that provides car drivers for hire. The data has
>>    most columns as categorical variables except the first which is tripId. 
>> The
>>    schema looks something like this
>>       1. { 
>> *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>>       *tripOutcome* }
>>       2. tripId the first column, is distinct per row which I put into
>>       theta sketches as shown later
>>       3. tripOutcome the last column, is a classification variable which
>>       has values "bad or good"
>>       4. The problem statement is to dod data analysis of combinations
>>       of dimensions that produce an unfavourable result that is
>>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>>    2. I was able to populate theta sketches for each dimension and value
>>    pair (example "city=Mumbai")  and store them as my effective data
>>    representation.
>>       1. with 10 as base and hence nominal entries as 2^10 the data is
>>       https://tinyurl.com/bkvz4xrh
>>       
>> <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>>
>>       2. with 12 as base and hence nominal entries as 2^12 the data is
>>       https://tinyurl.com/f9mzm8ef
>>       
>> <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>>
>>       3. with 14 as base and hence nominal entries as 2^14 the data is
>>       https://tinyurl.com/2p8kwj3m
>>       
>> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>       4. The data looks like this
>>          1. dimension=value, Base64 encoded Theta sketch formed from
>>          unique tripId
>>
>> I'm not sure what you are sketching here. What is the entry being
> inserted into the sketch?
>
The distinct event id is inserted in the sketch.


> Basically, what are you counting?
>
I would like to estimate the distinct number of impressions for any
combination example "AgeGroup=x1 & Gender=M" how many impressions were
served.


Normally, I would think of sketching as taking your original data (
tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
> tripOutcome ) and picking a column (or columns) that you want to count
> the distinct values of and generating a new table that contains the
> dimensions you care about plus that new sketch.
>

I am actually taking it a notch up. I am using theta sketch as a BitVector
to hold an estimate of the number of clicks (or number of impressions) and
using it to do data mining and association rule mining. I would be thrilled
to get on a call and explain. Please feel free to schedule a call from
10:00 pm to 12:00 pm IST on any weekday. I am in Bangalore.


> In this case, if you are counting trips, the resulting table would look
> like {source, estimate_usage_bins, city, zone, zone_demand_popularity,
> dayType, pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}.
> You could then report on a variety of combinations by dropping dimensions
> and aggregating (merging) the sketches.
>

The best example of what I am trying to achieve is
https://tinyurl.com/2p8kwj3m
<https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
 .


>
>>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>>    
>> <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>>    ] to generate a combination of "dimension=value" with a minimum support
>>    threshold. The input to this implementation takes the files above at point
>>    2. https://tinyurl.com/2p8kwj3m
>>    
>> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad
>>    and tripOutcome=good and then join the results. The answers I get are
>>    listed below
>>       1. with input having dimension=value and theta sketch with 2^10
>>       nominal entries answers can be found here
>>       https://tinyurl.com/yc3a3rde
>>       
>> <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>>
>>       2. with input having dimension=value and theta sketch with 2^12
>>       nominal entries answers can be found here
>>       https://tinyurl.com/5n7jjr32
>>       
>> <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>>       3. with input having dimension=value and theta sketch with 2^14
>>       nominal entries answers can be found here
>>       https://tinyurl.com/3wrxp4a7
>>       
>> <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>>    3.  The results show that 2^14 (16K) nominal sketches give immaculate
>>    results. There are errors in the results from the other sketches (RMSE).
>>    4. Sample of results
>>       1.
>>
>>       *level of combination, rule for combination, bad count, good
>>       count, percentage of bad count*
>>       2.
>>
>>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>>       source=android,26.0,9.0,35.0,74.0
>>
>>       3,booking_type=one_way_trip & city=Pune &
>>       source=android,26.0,10.0,36.0,72.0
>>
>>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>>       source=ios & 
>> zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>>
>>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>>
>>       5,booking_type=one_way_trip & city=Delhi NCR &
>>       pickUpHourOfDay=VERYLATE & source=ios &
>>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>>
>>       3,booking_type=one_way_trip & sla=Scheduled &
>>       zone=205,14.0,9.0,23.0,61.0
>>
>> I am aware from one of Lee's recent youtube videos that theta sketches
>> work like Sets(Bit sets) when large enough and the data is small. My use
>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
>> much higher. I could use the jaccard formula explained by Lee to get the
>> margin of error.
>>
>>
>> Sorry for the details but this should help you suggest if what I am doing
>> is indeed what tuple sketches does on its own. I am using update sketches
>> for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v
>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>>
>>
>>
>>
>>> A query would choose the desired coordinates of the 3 dimensions.
>>> If the query selects more than one coordinate from any one dimension,
>>> then you would need to first merge the corresponding coordinate sketches
>>> of that dimension together.
>>> A missing dimension would mean first merging all coordinate sketches of
>>> that dimension together.
>>> The final result sketch is an Intersection of the resulting 3
>>> dimensional sketches.
>>>
>>
>> Yes. the code here should serve as an example
>> https://tinyurl.com/4pzfsj6v
>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>>
>>
>>
>>>
>>> With the resulting Tuple sketch you iterate over all the retained items
>>> (Summaries) and sum each of the 3 fields.
>>> Then you divide each sum by Theta.  The result is the estimate of
>>> Impressions, Clicks, and AdSpend
>>> for the population of users selected by the query.  The size of that
>>> population would be obtained by getEstimate().
>>>
>>>
>> My question remains. Which base sketch type should I use for an additive
>> metric like Ad-spend? Would that be a *quantile sketch*?
>>
>
> I still think this question doesn't make sense. What are you trying to
> understand with the sketch? Are you trying to find out the distribution of
> adspend for a particular set of dimension values? Then that sounds like a
> quantile sketch, Are you trying to find out how many different values of
> adspend there were? That's a distinct counting sketch. Are you trying to
> find out the most common ad spend values? That would be a frequent items
> sketch.
>
> But really, none of those actually seem to match what you described above.
> What you described above sounded like you were simply trying to calculate
> the total AdSpend, which just requires sum, not a sketch.
>

In the big data world where one has millions of rows of event data, if we
could approximate impressions and clicks by having a sketch for every
dimension=value pair, we would make analytics very fast though approximate.




>
>>
>>
>>> Be aware that Intersections, by their nature, can increase error
>>> significantly. (see the Website).
>>> Try to avoid doing intersections with sketch coordinates that have very
>>> small populations, if possible.
>>>
>>
>> Yes. However if I use very high nominal entries like 2^20 shouldn't this
>> get mitigated?
>>
>
> The point of using a sketch is estimation. If you can't tolerate any
> error, you shouldn't be using the sketches in the first place. don't just
> assume that you SHOULD use a high nominal entries. That said, when using
> theta sketches and doing intersections, we do sometimes use values like
> 2^20 for nominal entries, but often it is overkill. Remember, in the case
> of intersections, you get high error when the intersection size is small
> compared with the overall set sizes, The case where you see high relative
> error are the cases where you are looking at small intersections. In most
> use cases, a relative error of 1,000% or even 10,000%  on an intersection
> size of 10 when comparing sets of a billion is still a reasonable error
> given how small the _actual_ value is.
>

It's not a zero or one thing. I can tolerate a few frequent item sets being
off by 5% but not more than that. I have understood the LowerBound and
UpperBound mechanism explained by Lee.



>
>>
>>
>>> You can get a sense of how good your results are by utilizing the
>>> getUpperBound() and getLowerBound()
>>> (with respect to the user population.).  The bigger that spread is, the
>>> worse your estimates will be.
>>>
>>> Thanks. This is perfect. I do not need to use the Jaccard to get my
>> estimate
>>
>>
>>
>>> The other option would be to do the cross-product upfront and create
>>> 6000 Tuple sketches.
>>> In this case the input tuple would feed only one sketch and the query
>>> would be to only one sketch.
>>>
>>
>> Since I am not using SQL queries and have my own unique use case, my
>> questions are quite different.
>>
>>
>>> This can quickly get unwieldy with more dimensions or high cardinality
>>> dimensions.
>>> It is more accurate because it avoids the intersections.  It is also
>>> possible to do a mix of
>>> these two approaches, but you would need to think it through carefully.
>>>
>>> I hope this helps.
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <
>>> vijay.sankar.ra...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Just like theta sketches are used for distinct count metrics like
>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>>
>>>> There is a big opportunity that I see in storing very little data in
>>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic
>>>> data very fast (although it is approximate).
>>>>
>>>> This question is best explained with an example. Say I have a fact
>>>> table schema as follows
>>>>
>>>>
>>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>>    {less than 18, 19-25, 26-35....}
>>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>>    3. *Country* is a  dimension with 200 possible values
>>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>>    Theta sketch or HLL)
>>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>>    sketch or HLL)
>>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>>    with quantile sketch??*)
>>>>
>>>>
>>>> The maximum possible number of entries in this table would be the cross
>>>> product of the cardinalities of the dimensions which is 10(Age_Group) x
>>>> 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>>> group by queries using set operations that sketches provides.
>>>>
>>>> For impression metric, one sketch for Gender=F, another sketch for
>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>> well.
>>>> For click metric, one sketch for Gender=F, another sketch for Gender=M,
>>>> yet another for Gender=Unknown and so on for other dimensions as well.
>>>> Question is what sketch would I need for ad-spend?
>>>>
>>>> On a side note, I have used Theta sketches in this manner and even
>>>> implemented ECLAT for frequent itemset computations and association rule
>>>> mining ... example from another project below with theta sketches used for
>>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>>
>>>> Level Conjunction Count
>>>> 2 H10 & MemberGender=F 74
>>>> 2 M15 & MemberGender=F 83
>>>> 2 G31 & MemberGender=F 59
>>>> 2 MemberGender=F & R13 66
>>>> *In the example above, H10, M15 etc are International Disease codes for
>>>> specific diseases. *https://www.aapc.com/codes/code-search/
>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>>
>>>> Hope this is a clear representation of the problem.
>>>>
>>>> Regards
>>>> Vijay
>>>>
>>>>
>>>>
>>>>
>>>>

Reply via email to