Thanks for your inputs, Karl and Lee.

Regards
kartik

On Fri, Sep 17, 2021 at 6:15 AM leerho <lee...@gmail.com> wrote:

> Kartik,
>>
>> *Do you think this is a good model to solve Q2?*
>
>  Your Q2 is in the domain of unique users. So, Yes.  And, if you are using
> Druid to do effectively a "select and group-by" of the raw data used to
> feed the two sketches, then just using Theta Sketches is sufficient.  The
> Tuple Sketches are not useful for you here.
>
> But I inferred from your original question that you wanted to take the
> results of the solution of Q2 and using intersections somehow solve your
> Q1, which is in the domain of Ad Impressions.  You can't do that, as they
> are completely separate domains.  And Karl is saying the same thing.
>
> Cheers,
> Lee.
>
> On Thu, Sep 16, 2021 at 1:48 AM Karl Matthias <k...@community.com> wrote:
>
>> Hi Kartik,
>>
>> I certainly don't have the expertise with this that Lee does, but
>> stepping back from your specific examples, to use a Theta sketch:
>>
>>    1. All of the sets/sketches you want to have interact together must
>>    contain the same domain values, be that User-ID or Impression-ID or
>>    something else.
>>    2. You can then intersect, disjoin, union, any number of them with
>>    the logic you want (e.g. join these two, AnotB a third) to get your 
>> result.
>>    3. Accuracy seems to be quite good for sketches that are within a
>>    couple of orders of magnitude of each other in size. From what I can tell
>>    the only place you have trouble is with very large sketches interacting
>>    with very small ones. You have some control over accuracy at the tradeoff
>>    of size.
>>
>> You may be able to use Druid to do this, but joins in Druid are new
>> (2020) and probably not ideal for what you are suggesting. In Druid you
>> would need to put all of the dimensions into a single table AFAICT. If you
>> have a smallish number of dimensions this is probably fine. If you have
>> many, and which dimensions you care about changes frequently, this might
>> not work. More about it here: https://imply.io/post/apache-druid-joins .
>> You might take a look at the datasketches-postgresql plugin.
>>
>> Hope it helps
>> Karl
>>
>>
>>
>> On Thu, Sep 16, 2021 at 6:38 AM Kartik Mahajan <dhaki...@gmail.com>
>> wrote:
>>
>>> Hi Lee,
>>>
>>> I am grateful to you for your inputs. Thank you so much. Let's focus on
>>> Q2 and let me explain by what I meant by "So after roll-up, I would end up
>>> with 1 theta-sketch per dimension value per day(assuming day level
>>> granularity) and I can do unions and intersections on these sketches to
>>> answer Q2"
>>>
>>> Let's say that my raw data records are of the following form {let's call
>>> this *Data-Model-1*}:
>>>
>>>          Date-Time-Stamp, Dim-name, Dim-value, UserID
>>>
>>>
>>> and let's say one of the Dim-name is "State", then for all the raw
>>> records which have the same Date-Time-Stamp, Dime-name="State" and
>>> Dim-value="California" and *any *User-ID, I would end up storing only 1
>>> Theta Sketch(after roll-up) in Druid.
>>>
>>> So essentially one can visualize the record(after rollup) as a key-value
>>> where the key is a tuple {Date-time-stamp, "State", "California"}  and
>>> value is a *Set* of User-IDs(stored as Theta Sketch)
>>>
>>> So if I fire a query Q2 ''' Find the count of unique users on a specific
>>> date where Dim-name="State" and Dim-Value="California" and
>>> Dim-name="Gender" and Dim-value="Male" ''', internally(in Druid) it would
>>> be taking the intersection of two Theta Sketches(one of which is mentioned
>>> above) and the second theta sketch is that of Gender "Male" for that
>>> specific date.
>>>
>>> Correct me if I am wrong in what I described above. (I've read the
>>> documentation and am aware that intersection error rate is proportional to
>>> the square root of the inverse of Jaccard index).
>>> *Do you think this is a good model to solve Q2?*
>>>
>>> Now talking about Q1, I've no idea how I can count total impressions
>>> served *using the same Data-Model-1*, Even I feel that Tuple-Sketches
>>> could be useful but won't really fit into the *above-described* data
>>> model(I am open to suggestions). I read more about them, think, and then
>>> post some questions.
>>>
>>> One silly question I've is that if I replace UserID above with
>>> ImpressionID, will I get an extremely high error percentage(I believe so)
>>> to answer Q1?
>>>
>>> I've taken note of your comments(thank you :-)). Any inputs from you
>>> about a *different data model *or any other inputs about how to solve
>>> Q1 would be helpful.
>>>
>>> Here are a list of ideas that I've:-
>>>
>>> 1) Model my data with hundreds of columns(one per dimension), typical
>>> OLAP denormalized table. Pros:- Most flexible. Cons:- Storage cost would be
>>> huge as *roll-up would essentially be zero *and I would end up storing
>>> Billions of raw data records(per day)
>>>
>>> 2) Use Bitmaps(for representing sets) but since that would also be space
>>> inefficient, use Roaring Bitmaps(http://roaringbitmap.org/)
>>>
>>> 3) Keep only 5% sample of the raw records(random uniform sampling) and
>>> then extrapolate the query results on the sample but multiplying it with 20
>>>
>>> I would like to note that the above 2 queries are only the initial set
>>> of queries that I found interesting and probably there would be at least 2
>>> dozen more queries that I would like to incorporate before I commit to a
>>> specific framework and data model, so any inputs based on your experience
>>> would be invaluable so that I don't end up re-inventing the wheel :-)
>>>
>>> Would love to hear your thoughts.
>>>
>>> Regards
>>> kartik
>>>
>>>
>>> On Thu, Sep 16, 2021 at 4:16 AM leerho <lee...@gmail.com> wrote:
>>>
>>>> Hi Karik,
>>>> The problem you describe is typical for on-line advertising and similar
>>>> to ones we have worked on before.  Solving this problem with sketches will
>>>> provide approximate results in near-real time.  However, doing so even with
>>>> sketches may require considerably more resources than you may be planning. 
>>>>  When
>>>> you said "we don’t have a data warehouse and ad-hoc OLAP",  I am
>>>> hoping you are not thinking that you can solve this problem at the scale
>>>> that you described with just a few machines.  Nonetheless, if you attempted
>>>> to solve this problem brute-force to produce exact answers, the resources
>>>> required would be far larger and the query latency would be
>>>> orders-of-magnitude longer.
>>>>
>>>> Your queries Q1 and Q2 are both similar in that you are seeking a
>>>> unique count of some identifier, either UserID or AdImpressionID, qualified
>>>> by dimensions. However they are very different in that the domain of users
>>>> is completely unrelated to the domain of ad impressions.  You cannot take a
>>>> sketch from the user domain and merge (Union, Intersection, Difference)
>>>> with a sketch from the ad impression domain.  Your statement: "So after
>>>> roll-up, I would end up with 1 theta-sketch per dimension value per
>>>> day(assuming day level granularity) and I can do unions and intersections
>>>> on these sketches to answer Q2" makes no sense to me.  Your raw data
>>>> records need to be much richer and something similar to the form (but
>>>> likely much more complex):
>>>>
>>>>    - Date-Time-Stamp, Dim-name, Dim-value, UserID, AdImpressionID
>>>>
>>>> Presumably, your AdImpressionID can be correlated back to another table
>>>> that provides richer dimensional information about the Ad impression, but
>>>> at least this record would allow correlation of unique users to ad
>>>> impressions.
>>>>
>>>> The Theta sketch family provides approximate intersections and
>>>> differences, but only against the originating domain of the data fed to the
>>>> sketch. You can intersect unique users of one sector with unique users of
>>>> another sector.  But you can't logically intersect a sketch of unique users
>>>> with a sketch of unique ad impressions as the result is totally 
>>>> meaningless.
>>>>
>>>> It sounds like what you are asking for is a full SQL-type Join (which
>>>> is a type of intersection).
>>>>
>>>> We don't have a sketch that directly addresses the full Join, but we
>>>> have the Tuple sketch, which is derived from the Theta sketch, that may be
>>>> able to help you.  I would encourage you to read up on the Tuple Sketch on
>>>> our website, and come back with more questions if it seems that the Tuple
>>>> Sketch might be of interest.
>>>>
>>>> ***
>>>> Other comments to think about:
>>>>
>>>>    - You stated that your time granularity was a "day" and that you
>>>>    had users in countries all over the world.  I'm sure you are aware that 
>>>> the
>>>>    definition of a "day" varies all over the world with about 40 or so
>>>>    time-zones.  You might find it more useful to define your time 
>>>> granularity
>>>>    to be an hour, for example, and create rotating time windows, and then 
>>>> with
>>>>    producing queries for a particular time-zone, simply merge the 24 
>>>> sketches
>>>>    for that time zone.
>>>>    - With respect to Druid, we have been using sketches with Druid for
>>>>    many years very successfully.  There are other systems with DataSketches
>>>>    integration as well, including PostgreSQL, Thinking Machine, Permutive, 
>>>> and
>>>>    a few more that are in the process of integration with DataSketches 
>>>> such as
>>>>    Impala and Pinot.
>>>>
>>>> Cheers,
>>>> Lee.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 14, 2021 at 10:09 PM Kartik Mahajan <dhaki...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We’re currently serving 100 Billion ad impressions per day across our
>>>>> 6 data centers. Out of these, we are serving about 80 Billion ad
>>>>> impressions in the US alone.
>>>>> Each ad impression can have hundreds of attributes(dimensions) e.g
>>>>> Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, 
>>>>> ad-id,
>>>>> site-id etc
>>>>>
>>>>> Currently, we don’t have a data warehouse and ad-hoc OLAP support is
>>>>> pretty much non-existent in our organization. This severely limits our
>>>>> ability to run adhoc queries and get a quick grasp of data.
>>>>>
>>>>> We want to answer the following 2 queries to begin with :-
>>>>>
>>>>> Q1) Find the total count of ad impressions which were served from 
>>>>> "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... 
>>>>> .. Dimensionk = d_k
>>>>>
>>>>> Q2) Find the total count of unique users which saw our ads from 
>>>>> "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 
>>>>> .... .. Dimensionk = d_k
>>>>>
>>>>> As I said each impression can have hundreds of dimensions(listed
>>>>> above) and cardinality of each dimension could be from few hundreds(say 
>>>>> for
>>>>> dimension Country) to Billions(for e.g User-id).
>>>>>
>>>>> We want approximate answers and the least infrastructure cost and
>>>>> query response time within < 5 minutes. I am thinking about using Druid 
>>>>> and Apache
>>>>> datasketches <https://datasketches.apache.org/>(Theta Sketch to be
>>>>> precise) for answering Q2 and using the following data-model :-
>>>>>
>>>>> Date Dimension Name     Dimension Value        Unique-User-ID(Theta 
>>>>> sketch)
>>>>> 2021/09/12   "Country"        "US"             37873-3udif-83748-2973483
>>>>> 2021/09/12   "Browser"       "Chrome"          37873-3aeuf-83748-2973483
>>>>> .
>>>>> .<Other records>
>>>>>
>>>>> So after roll-up, I would end up with 1 theta-sketch per dimension
>>>>> value per day(assuming day level granularity) and I can do unions and
>>>>> intersections on these sketches to answer Q2)
>>>>>
>>>>> I am planning to set k(nominal entries) to 10^5(please comment about
>>>>> what would be suitable k for this use case and expected storage amount
>>>>> required?)
>>>>>
>>>>> I’ve also read that the about theta sketch set ops accuracy here
>>>>> <https://datasketches.apache.org/docs/Theta/ThetaSketchSetOpsAccuracy.html>
>>>>>
>>>>> I would like to know if there is a better approach to solve Q2(with or
>>>>> without Druid)
>>>>>
>>>>> Also I would like to know how can I solve Q1?
>>>>>
>>>>> If I replace Unique-User-Id with “Impression-Id”, can I use the same
>>>>> data model to answer Q1? I believe that if I replace Unique-User-Id with
>>>>> “Impression-Id” then accuracy to count the total impressions would be way
>>>>> worse than that of Q2, because each ad-impression is assigned a unique id
>>>>> and we are currently serving 250 Billion per day.
>>>>>
>>>>> Please share your thoughts about solving Q1 and Q2.
>>>>>
>>>>> Regards
>>>>> kartik
>>>>>
>>>>

Reply via email to