Hi Lee, I am grateful to you for your inputs. Thank you so much. Let's focus on Q2 and let me explain by what I meant by "So after roll-up, I would end up with 1 theta-sketch per dimension value per day(assuming day level granularity) and I can do unions and intersections on these sketches to answer Q2"
Let's say that my raw data records are of the following form {let's call this *Data-Model-1*}: Date-Time-Stamp, Dim-name, Dim-value, UserID and let's say one of the Dim-name is "State", then for all the raw records which have the same Date-Time-Stamp, Dime-name="State" and Dim-value="California" and *any *User-ID, I would end up storing only 1 Theta Sketch(after roll-up) in Druid. So essentially one can visualize the record(after rollup) as a key-value where the key is a tuple {Date-time-stamp, "State", "California"} and value is a *Set* of User-IDs(stored as Theta Sketch) So if I fire a query Q2 ''' Find the count of unique users on a specific date where Dim-name="State" and Dim-Value="California" and Dim-name="Gender" and Dim-value="Male" ''', internally(in Druid) it would be taking the intersection of two Theta Sketches(one of which is mentioned above) and the second theta sketch is that of Gender "Male" for that specific date. Correct me if I am wrong in what I described above. (I've read the documentation and am aware that intersection error rate is proportional to the square root of the inverse of Jaccard index). *Do you think this is a good model to solve Q2?* Now talking about Q1, I've no idea how I can count total impressions served *using the same Data-Model-1*, Even I feel that Tuple-Sketches could be useful but won't really fit into the *above-described* data model(I am open to suggestions). I read more about them, think, and then post some questions. One silly question I've is that if I replace UserID above with ImpressionID, will I get an extremely high error percentage(I believe so) to answer Q1? I've taken note of your comments(thank you :-)). Any inputs from you about a *different data model *or any other inputs about how to solve Q1 would be helpful. Here are a list of ideas that I've:- 1) Model my data with hundreds of columns(one per dimension), typical OLAP denormalized table. Pros:- Most flexible. Cons:- Storage cost would be huge as *roll-up would essentially be zero *and I would end up storing Billions of raw data records(per day) 2) Use Bitmaps(for representing sets) but since that would also be space inefficient, use Roaring Bitmaps(http://roaringbitmap.org/) 3) Keep only 5% sample of the raw records(random uniform sampling) and then extrapolate the query results on the sample but multiplying it with 20 I would like to note that the above 2 queries are only the initial set of queries that I found interesting and probably there would be at least 2 dozen more queries that I would like to incorporate before I commit to a specific framework and data model, so any inputs based on your experience would be invaluable so that I don't end up re-inventing the wheel :-) Would love to hear your thoughts. Regards kartik On Thu, Sep 16, 2021 at 4:16 AM leerho <lee...@gmail.com> wrote: > Hi Karik, > The problem you describe is typical for on-line advertising and similar to > ones we have worked on before. Solving this problem with sketches will > provide approximate results in near-real time. However, doing so even with > sketches may require considerably more resources than you may be planning. > When > you said "we don’t have a data warehouse and ad-hoc OLAP", I am hoping > you are not thinking that you can solve this problem at the scale that you > described with just a few machines. Nonetheless, if you attempted to solve > this problem brute-force to produce exact answers, the resources required > would be far larger and the query latency would be orders-of-magnitude > longer. > > Your queries Q1 and Q2 are both similar in that you are seeking a unique > count of some identifier, either UserID or AdImpressionID, qualified by > dimensions. However they are very different in that the domain of users is > completely unrelated to the domain of ad impressions. You cannot take a > sketch from the user domain and merge (Union, Intersection, Difference) > with a sketch from the ad impression domain. Your statement: "So after > roll-up, I would end up with 1 theta-sketch per dimension value per > day(assuming day level granularity) and I can do unions and intersections > on these sketches to answer Q2" makes no sense to me. Your raw data > records need to be much richer and something similar to the form (but > likely much more complex): > > - Date-Time-Stamp, Dim-name, Dim-value, UserID, AdImpressionID > > Presumably, your AdImpressionID can be correlated back to another table > that provides richer dimensional information about the Ad impression, but > at least this record would allow correlation of unique users to ad > impressions. > > The Theta sketch family provides approximate intersections and > differences, but only against the originating domain of the data fed to the > sketch. You can intersect unique users of one sector with unique users of > another sector. But you can't logically intersect a sketch of unique users > with a sketch of unique ad impressions as the result is totally meaningless. > > It sounds like what you are asking for is a full SQL-type Join (which is a > type of intersection). > > We don't have a sketch that directly addresses the full Join, but we have > the Tuple sketch, which is derived from the Theta sketch, that may be able > to help you. I would encourage you to read up on the Tuple Sketch on our > website, and come back with more questions if it seems that the Tuple > Sketch might be of interest. > > *** > Other comments to think about: > > - You stated that your time granularity was a "day" and that you had > users in countries all over the world. I'm sure you are aware that the > definition of a "day" varies all over the world with about 40 or so > time-zones. You might find it more useful to define your time granularity > to be an hour, for example, and create rotating time windows, and then with > producing queries for a particular time-zone, simply merge the 24 sketches > for that time zone. > - With respect to Druid, we have been using sketches with Druid for > many years very successfully. There are other systems with DataSketches > integration as well, including PostgreSQL, Thinking Machine, Permutive, and > a few more that are in the process of integration with DataSketches such as > Impala and Pinot. > > Cheers, > Lee. > > > > > > > On Tue, Sep 14, 2021 at 10:09 PM Kartik Mahajan <dhaki...@gmail.com> > wrote: > >> Hi, >> >> We’re currently serving 100 Billion ad impressions per day across our 6 >> data centers. Out of these, we are serving about 80 Billion ad impressions >> in the US alone. >> Each ad impression can have hundreds of attributes(dimensions) e.g >> Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, >> site-id etc >> >> Currently, we don’t have a data warehouse and ad-hoc OLAP support is >> pretty much non-existent in our organization. This severely limits our >> ability to run adhoc queries and get a quick grasp of data. >> >> We want to answer the following 2 queries to begin with :- >> >> Q1) Find the total count of ad impressions which were served from >> "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. >> Dimensionk = d_k >> >> Q2) Find the total count of unique users which saw our ads from "beginDate" >> to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk >> = d_k >> >> As I said each impression can have hundreds of dimensions(listed above) >> and cardinality of each dimension could be from few hundreds(say for >> dimension Country) to Billions(for e.g User-id). >> >> We want approximate answers and the least infrastructure cost and query >> response time within < 5 minutes. I am thinking about using Druid and Apache >> datasketches <https://datasketches.apache.org/>(Theta Sketch to be >> precise) for answering Q2 and using the following data-model :- >> >> Date Dimension Name Dimension Value Unique-User-ID(Theta sketch) >> 2021/09/12 "Country" "US" 37873-3udif-83748-2973483 >> 2021/09/12 "Browser" "Chrome" 37873-3aeuf-83748-2973483 >> . >> .<Other records> >> >> So after roll-up, I would end up with 1 theta-sketch per dimension value >> per day(assuming day level granularity) and I can do unions and >> intersections on these sketches to answer Q2) >> >> I am planning to set k(nominal entries) to 10^5(please comment about what >> would be suitable k for this use case and expected storage amount required?) >> >> I’ve also read that the about theta sketch set ops accuracy here >> <https://datasketches.apache.org/docs/Theta/ThetaSketchSetOpsAccuracy.html> >> >> I would like to know if there is a better approach to solve Q2(with or >> without Druid) >> >> Also I would like to know how can I solve Q1? >> >> If I replace Unique-User-Id with “Impression-Id”, can I use the same data >> model to answer Q1? I believe that if I replace Unique-User-Id with >> “Impression-Id” then accuracy to count the total impressions would be way >> worse than that of Q2, because each ad-impression is assigned a unique id >> and we are currently serving 250 Billion per day. >> >> Please share your thoughts about solving Q1 and Q2. >> >> Regards >> kartik >> >