Hi All,

I am pretty new to the community and I am trying to get my head wrapped
around the usage of the theta sketch python library to compute approx
distinct counts.

Here is my use case:

   - I have the following table structure: visit_id, dimension (array),
   date (Single GMT day i.e. 1/1/2022)
   - I want to run a distinct count of visit_ids over a dynamic date range
   and group them by dimension sets i.e. select count(visit_id) where date >=
   a and date <= b and dimension contains x or dimension contains y and
   dimension contains z

What I am planning is:

   - Create a theta sketch cube and store them in a hashtable i.e. dynamodb
   using a workflow orchestration tool like airflow for each date
   - Retrieve the theta sketch cubes for each day in the date range and do
   union and intersection on request

Here is my question:

   - I was trying to look at this example:
   
https://github.com/apache/datasketches-cpp/blob/763f9249de576dca8c080fb4f3f438625a332b0b/python/tests/theta_test.py#L20
      - For creating the sketches should I be calculating the distinct
      count group by date and dimension first and use that value with the key
      being some combination of the dimension and date?
      - Would the blob I store into the hashtable be the key that I
      construct with the result returned back by the example
      generate_theta_sketch method in the example test?
         - If this is the case, in order to query a date range I would have
         to construct a union of similar dimensions with different
dates within the
         date range first before I can do any unions/intersections of different
         dimension values in that date range?  Is there an easier way?


-- 
Kevin Peng
Chief Engineer, DMP
305.775.2463

Reply via email to