Ad impression counting and unique users counting using data sketches

Kartik Mahajan Tue, 14 Sep 2021 22:09:41 -0700

Hi,

We’re currently serving 100 Billion ad impressions per day across our 6
data centers. Out of these, we are serving about 80 Billion ad impressions
in the US alone.
Each ad impression can have hundreds of attributes(dimensions) e.g Country,
City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, site-id
etc


Currently, we don’t have a data warehouse and ad-hoc OLAP support is pretty
much non-existent in our organization. This severely limits our ability to
run adhoc queries and get a quick grasp of data.

We want to answer the following 2 queries to begin with :-

Q1) Find the total count of ad impressions which were served from
"beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2
.... .. Dimensionk = d_k

Q2) Find the total count of unique users which saw our ads from
"beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2
.... .. Dimensionk = d_k

As I said each impression can have hundreds of dimensions(listed above) and
cardinality of each dimension could be from few hundreds(say for dimension
Country) to Billions(for e.g User-id).

We want approximate answers and the least infrastructure cost and query
response time within < 5 minutes. I am thinking about using Druid and Apache
datasketches <https://datasketches.apache.org/>(Theta Sketch to be precise)
for answering Q2 and using the following data-model :-

Date Dimension Name     Dimension Value        Unique-User-ID(Theta sketch)
2021/09/12   "Country"        "US"             37873-3udif-83748-2973483
2021/09/12   "Browser"       "Chrome"          37873-3aeuf-83748-2973483
.
.<Other records>

So after roll-up, I would end up with 1 theta-sketch per dimension value
per day(assuming day level granularity) and I can do unions and
intersections on these sketches to answer Q2)

I am planning to set k(nominal entries) to 10^5(please comment about what
would be suitable k for this use case and expected storage amount required?)

I’ve also read that the about theta sketch set ops accuracy here
<https://datasketches.apache.org/docs/Theta/ThetaSketchSetOpsAccuracy.html>

I would like to know if there is a better approach to solve Q2(with or
without Druid)

Also I would like to know how can I solve Q1?

If I replace Unique-User-Id with “Impression-Id”, can I use the same data
model to answer Q1? I believe that if I replace Unique-User-Id with
“Impression-Id” then accuracy to count the total impressions would be way
worse than that of Q2, because each ad-impression is assigned a unique id
and we are currently serving 250 Billion per day.

Please share your thoughts about solving Q1 and Q2.

Regards
kartik

Ad impression counting and unique users counting using data sketches

Reply via email to