OK, this is interesting. I've got some concerns and questions that I've put
inline below.

Will



Will Lauer


Senior Principal Architect, Audience & Advertising Reporting

Data Platforms & Systems Engineering


M 508 561 6427

Champaign Office

1908 S. First St

Champaign, IL 61822



On Mon, May 2, 2022 at 9:19 AM vijay rajan <vijay.sankar.ra...@gmail.com>
wrote:

> Thanks Lee. Please find my answers inline below in blue. I think you will
> find my use case very interesting. My next endeavor would be to make a
> decision tree with entropy / gini impurity measures with sketches. I am
> amazed at some of the results I have gotten. You may find this quite
> interesting.
>
> On Sun, May 1, 2022 at 12:55 AM leerho <lee...@gmail.com> wrote:
>
>> Vijay,
>> Sorry about the delay in getting back to you.
>>
>> There is some critical information missing from your description and that
>> is the domain of what you are sketching.
>> I presume that it is User-IDs, otherwise it doesn't make sense.
>> If this is the case I think the solution can be achieved in a couple of
>> ways.  Going forward with this assumption:
>>
>
>> Your raw data would look something like this:
>>
>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>
>>
>>
> Actually it would be an event_id instead of user_Id. The domain for the
> data is a hypothetical online advertising summary.
> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>
> The problem that we faced in the past was that it was impossible to "join"
> a click to an impression. While the raw impression data would have a unique
> event_id the raw click data would also have its own event_id. The
> dimensions were the same for clicks and impressions. What we did back in
> thge day was aggregate the data for impressions separately and then
> aggregate for clicks separately and then join.
>
> You could assume the same here. Each dimension aggregate tuple {AgeGroup,
> Gender and Country in out case} would have a theta sketch for impressions
> and another for clicks.
>

I don't see what a "theta sketch for impressions and another for clicks"
would mean here. Theta sketches are for doing distinct counts, so I could
understand a theta sketch for EventId, counting the number of distinct
event ids associated with the AgeGroup/Gender/Country combination. But that
doesn't seem to make sense for clicks or impressions. If your original data
was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it
implies to me that clicks and impressions would always be 1 (that is, an
event is either a single click or single impression). When you drop the
EventId dimension, the natural aggregation for these (as well as for
AdSpend) is summation, giving you the total clicks, total impressions (and
total AdSpend) for the Age/Gender/Country combination. If you were to
generate a Theta sketch, it seems like it would be non-sensical, as you all
impressions or clicks inserted would be the same value and the theta sketch
would always return "1".

>
>
>
>> A proposed solution would be to configure a TupleSketch with 3 fields:  
>> (Impressions,
>> Clicks, AdSpend).
>> (This requires a little programming.  You would need to create classes
>> that implement the interfaces Summary,
>> SummaryFactory and SummarySetOperations.  There are examples of how to do
>> this on the website.)
>>
>>
> I will look into Tuple sketch and the SummaryFactory and Summary Set
> operations. I did try reading it but I cannot say I understood this
> completely. What type of sketch would we use for AdSpend? Would it be
> Quantile?
>

>
>
>> Then you would process your raw data as follows:
>>
>>    - Partition your raw data into 213 streams defined by the three
>>    dimensions:
>>       - 10 X AgeGroup
>>       - 3 X Gender
>>       - 200 X Country
>>    - Each stream would feed a configured TupleSketch as above. (so you
>>    need only 213 sketches).
>>    - These sketches are in the form of a 3D hypercube.
>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>
>>
> I believe I did something similar. Let me try explaining this with
> concrete examples
>
>    1. This is the raw data
>    
> <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>    click on the link https://tinyurl.com/3d2r7xjh
>    
> <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>    ) for a company in India that provides car drivers for hire. The data has
>    most columns as categorical variables except the first which is tripId. The
>    schema looks something like this
>       1. { 
> *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>       *tripOutcome* }
>       2. tripId the first column, is distinct per row which I put into
>       theta sketches as shown later
>       3. tripOutcome the last column, is a classification variable which
>       has values "bad or good"
>       4. The problem statement is to dod data analysis of combinations of
>       dimensions that produce an unfavourable result that is
>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>    2. I was able to populate theta sketches for each dimension and value
>    pair (example "city=Mumbai")  and store them as my effective data
>    representation.
>       1. with 10 as base and hence nominal entries as 2^10 the data is
>       https://tinyurl.com/bkvz4xrh
>       
> <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>
>       2. with 12 as base and hence nominal entries as 2^12 the data is
>       https://tinyurl.com/f9mzm8ef
>       
> <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>
>       3. with 14 as base and hence nominal entries as 2^14 the data is
>       https://tinyurl.com/2p8kwj3m
>       
> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>       4. The data looks like this
>          1. dimension=value, Base64 encoded Theta sketch formed from
>          unique tripId
>
> I'm not sure what you are sketching here. What is the entry being inserted
into the sketch? Basically, what are you counting? Normally, I would think
of sketching as taking your original data (
tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
tripOutcome ) and picking a column (or columns) that you want to count the
distinct values of and generating a new table that contains the dimensions
you care about plus that new sketch. In this case, if you are counting
trips, the resulting table would look like {source, estimate_usage_bins,
city, zone, zone_demand_popularity, dayType, pickupHourOfDay, sla,
booking_type, tripOutcome, sketch(tripId)}. You could then report on a
variety of combinations by dropping dimensions and aggregating (merging)
the sketches.

>
>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>    
> <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>    ] to generate a combination of "dimension=value" with a minimum support
>    threshold. The input to this implementation takes the files above at point
>    2. https://tinyurl.com/2p8kwj3m
>    
> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad and
>    tripOutcome=good and then join the results. The answers I get are listed
>    below
>       1. with input having dimension=value and theta sketch with 2^10
>       nominal entries answers can be found here
>       https://tinyurl.com/yc3a3rde
>       
> <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>
>       2. with input having dimension=value and theta sketch with 2^12
>       nominal entries answers can be found here
>       https://tinyurl.com/5n7jjr32
>       
> <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>       3. with input having dimension=value and theta sketch with 2^14
>       nominal entries answers can be found here
>       https://tinyurl.com/3wrxp4a7
>       
> <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>    3.  The results show that 2^14 (16K) nominal sketches give immaculate
>    results. There are errors in the results from the other sketches (RMSE).
>    4. Sample of results
>       1.
>
>       *level of combination, rule for combination, bad count, good count,
>       percentage of bad count*
>       2.
>
>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>       source=android,26.0,9.0,35.0,74.0
>
>       3,booking_type=one_way_trip & city=Pune &
>       source=android,26.0,10.0,36.0,72.0
>
>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>       source=ios & 
> zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>
>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>
>       5,booking_type=one_way_trip & city=Delhi NCR &
>       pickUpHourOfDay=VERYLATE & source=ios &
>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>
>       3,booking_type=one_way_trip & sla=Scheduled &
>       zone=205,14.0,9.0,23.0,61.0
>
> I am aware from one of Lee's recent youtube videos that theta sketches
> work like Sets(Bit sets) when large enough and the data is small. My use
> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
> much higher. I could use the jaccard formula explained by Lee to get the
> margin of error.
>
>
> Sorry for the details but this should help you suggest if what I am doing
> is indeed what tuple sketches does on its own. I am using update sketches
> for merging using conjunction. Code is here *https://tinyurl.com/4pzfsj6v
> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>
>
>
>
>> A query would choose the desired coordinates of the 3 dimensions.
>> If the query selects more than one coordinate from any one dimension,
>> then you would need to first merge the corresponding coordinate sketches
>> of that dimension together.
>> A missing dimension would mean first merging all coordinate sketches of
>> that dimension together.
>> The final result sketch is an Intersection of the resulting 3 dimensional
>> sketches.
>>
>
> Yes. the code here should serve as an example https://tinyurl.com/4pzfsj6v
> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>
>
>
>>
>> With the resulting Tuple sketch you iterate over all the retained items
>> (Summaries) and sum each of the 3 fields.
>> Then you divide each sum by Theta.  The result is the estimate of
>> Impressions, Clicks, and AdSpend
>> for the population of users selected by the query.  The size of that
>> population would be obtained by getEstimate().
>>
>>
> My question remains. Which base sketch type should I use for an additive
> metric like Ad-spend? Would that be a *quantile sketch*?
>

I still think this question doesn't make sense. What are you trying to
understand with the sketch? Are you trying to find out the distribution of
adspend for a particular set of dimension values? Then that sounds like a
quantile sketch, Are you trying to find out how many different values of
adspend there were? That's a distinct counting sketch. Are you trying to
find out the most common ad spend values? That would be a frequent items
sketch.

But really, none of those actually seem to match what you described above.
What you described above sounded like you were simply trying to calculate
the total AdSpend, which just requires sum, not a sketch.

>
>
>
>> Be aware that Intersections, by their nature, can increase error
>> significantly. (see the Website).
>> Try to avoid doing intersections with sketch coordinates that have very
>> small populations, if possible.
>>
>
> Yes. However if I use very high nominal entries like 2^20 shouldn't this
> get mitigated?
>

The point of using a sketch is estimation. If you can't tolerate any error,
you shouldn't be using the sketches in the first place. don't just assume
that you SHOULD use a high nominal entries. That said, when using theta
sketches and doing intersections, we do sometimes use values like 2^20 for
nominal entries, but often it is overkill. Remember, in the case of
intersections, you get high error when the intersection size is small
compared with the overall set sizes, The case where you see high relative
error are the cases where you are looking at small intersections. In most
use cases, a relative error of 1,000% or even 10,000%  on an intersection
size of 10 when comparing sets of a billion is still a reasonable error
given how small the _actual_ value is.

>
>
>
>> You can get a sense of how good your results are by utilizing the
>> getUpperBound() and getLowerBound()
>> (with respect to the user population.).  The bigger that spread is, the
>> worse your estimates will be.
>>
>> Thanks. This is perfect. I do not need to use the Jaccard to get my
> estimate
>
>
>
>> The other option would be to do the cross-product upfront and create 6000
>> Tuple sketches.
>> In this case the input tuple would feed only one sketch and the query
>> would be to only one sketch.
>>
>
> Since I am not using SQL queries and have my own unique use case, my
> questions are quite different.
>
>
>> This can quickly get unwieldy with more dimensions or high cardinality
>> dimensions.
>> It is more accurate because it avoids the intersections.  It is also
>> possible to do a mix of
>> these two approaches, but you would need to think it through carefully.
>>
>> I hope this helps.
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <vijay.sankar.ra...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> Just like theta sketches are used for distinct count metrics like
>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>
>>> There is a big opportunity that I see in storing very little data in
>>> sketches (which I use as a set) which makes retrieval of aggregate/analytic
>>> data very fast (although it is approximate).
>>>
>>> This question is best explained with an example. Say I have a fact table
>>> schema as follows
>>>
>>>
>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>    {less than 18, 19-25, 26-35....}
>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>    3. *Country* is a  dimension with 200 possible values
>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>    Theta sketch or HLL)
>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>    sketch or HLL)
>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>    with quantile sketch??*)
>>>
>>>
>>> The maximum possible number of entries in this table would be the cross
>>> product of the cardinalities of the dimensions which is 10(Age_Group) x
>>> 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>> group by queries using set operations that sketches provides.
>>>
>>> For impression metric, one sketch for Gender=F, another sketch for
>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>> well.
>>> For click metric, one sketch for Gender=F, another sketch for Gender=M,
>>> yet another for Gender=Unknown and so on for other dimensions as well.
>>> Question is what sketch would I need for ad-spend?
>>>
>>> On a side note, I have used Theta sketches in this manner and even
>>> implemented ECLAT for frequent itemset computations and association rule
>>> mining ... example from another project below with theta sketches used for
>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>
>>> Level Conjunction Count
>>> 2 H10 & MemberGender=F 74
>>> 2 M15 & MemberGender=F 83
>>> 2 G31 & MemberGender=F 59
>>> 2 MemberGender=F & R13 66
>>> *In the example above, H10, M15 etc are International Disease codes for
>>> specific diseases. *https://www.aapc.com/codes/code-search/
>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>
>>> Hope this is a clear representation of the problem.
>>>
>>> Regards
>>> Vijay
>>>
>>>
>>>
>>>
>>>

Reply via email to