Thanks for your inputs, Karl and Lee. Regards kartik
On Fri, Sep 17, 2021 at 6:15 AM leerho <lee...@gmail.com> wrote: > Kartik, >> >> *Do you think this is a good model to solve Q2?* > > Your Q2 is in the domain of unique users. So, Yes. And, if you are using > Druid to do effectively a "select and group-by" of the raw data used to > feed the two sketches, then just using Theta Sketches is sufficient. The > Tuple Sketches are not useful for you here. > > But I inferred from your original question that you wanted to take the > results of the solution of Q2 and using intersections somehow solve your > Q1, which is in the domain of Ad Impressions. You can't do that, as they > are completely separate domains. And Karl is saying the same thing. > > Cheers, > Lee. > > On Thu, Sep 16, 2021 at 1:48 AM Karl Matthias <k...@community.com> wrote: > >> Hi Kartik, >> >> I certainly don't have the expertise with this that Lee does, but >> stepping back from your specific examples, to use a Theta sketch: >> >> 1. All of the sets/sketches you want to have interact together must >> contain the same domain values, be that User-ID or Impression-ID or >> something else. >> 2. You can then intersect, disjoin, union, any number of them with >> the logic you want (e.g. join these two, AnotB a third) to get your >> result. >> 3. Accuracy seems to be quite good for sketches that are within a >> couple of orders of magnitude of each other in size. From what I can tell >> the only place you have trouble is with very large sketches interacting >> with very small ones. You have some control over accuracy at the tradeoff >> of size. >> >> You may be able to use Druid to do this, but joins in Druid are new >> (2020) and probably not ideal for what you are suggesting. In Druid you >> would need to put all of the dimensions into a single table AFAICT. If you >> have a smallish number of dimensions this is probably fine. If you have >> many, and which dimensions you care about changes frequently, this might >> not work. More about it here: https://imply.io/post/apache-druid-joins . >> You might take a look at the datasketches-postgresql plugin. >> >> Hope it helps >> Karl >> >> >> >> On Thu, Sep 16, 2021 at 6:38 AM Kartik Mahajan <dhaki...@gmail.com> >> wrote: >> >>> Hi Lee, >>> >>> I am grateful to you for your inputs. Thank you so much. Let's focus on >>> Q2 and let me explain by what I meant by "So after roll-up, I would end up >>> with 1 theta-sketch per dimension value per day(assuming day level >>> granularity) and I can do unions and intersections on these sketches to >>> answer Q2" >>> >>> Let's say that my raw data records are of the following form {let's call >>> this *Data-Model-1*}: >>> >>> Date-Time-Stamp, Dim-name, Dim-value, UserID >>> >>> >>> and let's say one of the Dim-name is "State", then for all the raw >>> records which have the same Date-Time-Stamp, Dime-name="State" and >>> Dim-value="California" and *any *User-ID, I would end up storing only 1 >>> Theta Sketch(after roll-up) in Druid. >>> >>> So essentially one can visualize the record(after rollup) as a key-value >>> where the key is a tuple {Date-time-stamp, "State", "California"} and >>> value is a *Set* of User-IDs(stored as Theta Sketch) >>> >>> So if I fire a query Q2 ''' Find the count of unique users on a specific >>> date where Dim-name="State" and Dim-Value="California" and >>> Dim-name="Gender" and Dim-value="Male" ''', internally(in Druid) it would >>> be taking the intersection of two Theta Sketches(one of which is mentioned >>> above) and the second theta sketch is that of Gender "Male" for that >>> specific date. >>> >>> Correct me if I am wrong in what I described above. (I've read the >>> documentation and am aware that intersection error rate is proportional to >>> the square root of the inverse of Jaccard index). >>> *Do you think this is a good model to solve Q2?* >>> >>> Now talking about Q1, I've no idea how I can count total impressions >>> served *using the same Data-Model-1*, Even I feel that Tuple-Sketches >>> could be useful but won't really fit into the *above-described* data >>> model(I am open to suggestions). I read more about them, think, and then >>> post some questions. >>> >>> One silly question I've is that if I replace UserID above with >>> ImpressionID, will I get an extremely high error percentage(I believe so) >>> to answer Q1? >>> >>> I've taken note of your comments(thank you :-)). Any inputs from you >>> about a *different data model *or any other inputs about how to solve >>> Q1 would be helpful. >>> >>> Here are a list of ideas that I've:- >>> >>> 1) Model my data with hundreds of columns(one per dimension), typical >>> OLAP denormalized table. Pros:- Most flexible. Cons:- Storage cost would be >>> huge as *roll-up would essentially be zero *and I would end up storing >>> Billions of raw data records(per day) >>> >>> 2) Use Bitmaps(for representing sets) but since that would also be space >>> inefficient, use Roaring Bitmaps(http://roaringbitmap.org/) >>> >>> 3) Keep only 5% sample of the raw records(random uniform sampling) and >>> then extrapolate the query results on the sample but multiplying it with 20 >>> >>> I would like to note that the above 2 queries are only the initial set >>> of queries that I found interesting and probably there would be at least 2 >>> dozen more queries that I would like to incorporate before I commit to a >>> specific framework and data model, so any inputs based on your experience >>> would be invaluable so that I don't end up re-inventing the wheel :-) >>> >>> Would love to hear your thoughts. >>> >>> Regards >>> kartik >>> >>> >>> On Thu, Sep 16, 2021 at 4:16 AM leerho <lee...@gmail.com> wrote: >>> >>>> Hi Karik, >>>> The problem you describe is typical for on-line advertising and similar >>>> to ones we have worked on before. Solving this problem with sketches will >>>> provide approximate results in near-real time. However, doing so even with >>>> sketches may require considerably more resources than you may be planning. >>>> When >>>> you said "we don’t have a data warehouse and ad-hoc OLAP", I am >>>> hoping you are not thinking that you can solve this problem at the scale >>>> that you described with just a few machines. Nonetheless, if you attempted >>>> to solve this problem brute-force to produce exact answers, the resources >>>> required would be far larger and the query latency would be >>>> orders-of-magnitude longer. >>>> >>>> Your queries Q1 and Q2 are both similar in that you are seeking a >>>> unique count of some identifier, either UserID or AdImpressionID, qualified >>>> by dimensions. However they are very different in that the domain of users >>>> is completely unrelated to the domain of ad impressions. You cannot take a >>>> sketch from the user domain and merge (Union, Intersection, Difference) >>>> with a sketch from the ad impression domain. Your statement: "So after >>>> roll-up, I would end up with 1 theta-sketch per dimension value per >>>> day(assuming day level granularity) and I can do unions and intersections >>>> on these sketches to answer Q2" makes no sense to me. Your raw data >>>> records need to be much richer and something similar to the form (but >>>> likely much more complex): >>>> >>>> - Date-Time-Stamp, Dim-name, Dim-value, UserID, AdImpressionID >>>> >>>> Presumably, your AdImpressionID can be correlated back to another table >>>> that provides richer dimensional information about the Ad impression, but >>>> at least this record would allow correlation of unique users to ad >>>> impressions. >>>> >>>> The Theta sketch family provides approximate intersections and >>>> differences, but only against the originating domain of the data fed to the >>>> sketch. You can intersect unique users of one sector with unique users of >>>> another sector. But you can't logically intersect a sketch of unique users >>>> with a sketch of unique ad impressions as the result is totally >>>> meaningless. >>>> >>>> It sounds like what you are asking for is a full SQL-type Join (which >>>> is a type of intersection). >>>> >>>> We don't have a sketch that directly addresses the full Join, but we >>>> have the Tuple sketch, which is derived from the Theta sketch, that may be >>>> able to help you. I would encourage you to read up on the Tuple Sketch on >>>> our website, and come back with more questions if it seems that the Tuple >>>> Sketch might be of interest. >>>> >>>> *** >>>> Other comments to think about: >>>> >>>> - You stated that your time granularity was a "day" and that you >>>> had users in countries all over the world. I'm sure you are aware that >>>> the >>>> definition of a "day" varies all over the world with about 40 or so >>>> time-zones. You might find it more useful to define your time >>>> granularity >>>> to be an hour, for example, and create rotating time windows, and then >>>> with >>>> producing queries for a particular time-zone, simply merge the 24 >>>> sketches >>>> for that time zone. >>>> - With respect to Druid, we have been using sketches with Druid for >>>> many years very successfully. There are other systems with DataSketches >>>> integration as well, including PostgreSQL, Thinking Machine, Permutive, >>>> and >>>> a few more that are in the process of integration with DataSketches >>>> such as >>>> Impala and Pinot. >>>> >>>> Cheers, >>>> Lee. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Sep 14, 2021 at 10:09 PM Kartik Mahajan <dhaki...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> We’re currently serving 100 Billion ad impressions per day across our >>>>> 6 data centers. Out of these, we are serving about 80 Billion ad >>>>> impressions in the US alone. >>>>> Each ad impression can have hundreds of attributes(dimensions) e.g >>>>> Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, >>>>> ad-id, >>>>> site-id etc >>>>> >>>>> Currently, we don’t have a data warehouse and ad-hoc OLAP support is >>>>> pretty much non-existent in our organization. This severely limits our >>>>> ability to run adhoc queries and get a quick grasp of data. >>>>> >>>>> We want to answer the following 2 queries to begin with :- >>>>> >>>>> Q1) Find the total count of ad impressions which were served from >>>>> "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... >>>>> .. Dimensionk = d_k >>>>> >>>>> Q2) Find the total count of unique users which saw our ads from >>>>> "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 >>>>> .... .. Dimensionk = d_k >>>>> >>>>> As I said each impression can have hundreds of dimensions(listed >>>>> above) and cardinality of each dimension could be from few hundreds(say >>>>> for >>>>> dimension Country) to Billions(for e.g User-id). >>>>> >>>>> We want approximate answers and the least infrastructure cost and >>>>> query response time within < 5 minutes. I am thinking about using Druid >>>>> and Apache >>>>> datasketches <https://datasketches.apache.org/>(Theta Sketch to be >>>>> precise) for answering Q2 and using the following data-model :- >>>>> >>>>> Date Dimension Name Dimension Value Unique-User-ID(Theta >>>>> sketch) >>>>> 2021/09/12 "Country" "US" 37873-3udif-83748-2973483 >>>>> 2021/09/12 "Browser" "Chrome" 37873-3aeuf-83748-2973483 >>>>> . >>>>> .<Other records> >>>>> >>>>> So after roll-up, I would end up with 1 theta-sketch per dimension >>>>> value per day(assuming day level granularity) and I can do unions and >>>>> intersections on these sketches to answer Q2) >>>>> >>>>> I am planning to set k(nominal entries) to 10^5(please comment about >>>>> what would be suitable k for this use case and expected storage amount >>>>> required?) >>>>> >>>>> I’ve also read that the about theta sketch set ops accuracy here >>>>> <https://datasketches.apache.org/docs/Theta/ThetaSketchSetOpsAccuracy.html> >>>>> >>>>> I would like to know if there is a better approach to solve Q2(with or >>>>> without Druid) >>>>> >>>>> Also I would like to know how can I solve Q1? >>>>> >>>>> If I replace Unique-User-Id with “Impression-Id”, can I use the same >>>>> data model to answer Q1? I believe that if I replace Unique-User-Id with >>>>> “Impression-Id” then accuracy to count the total impressions would be way >>>>> worse than that of Q2, because each ad-impression is assigned a unique id >>>>> and we are currently serving 250 Billion per day. >>>>> >>>>> Please share your thoughts about solving Q1 and Q2. >>>>> >>>>> Regards >>>>> kartik >>>>> >>>>