Hi All, I am pretty new to the community and I am trying to get my head wrapped around the usage of the theta sketch python library to compute approx distinct counts.
Here is my use case: - I have the following table structure: visit_id, dimension (array), date (Single GMT day i.e. 1/1/2022) - I want to run a distinct count of visit_ids over a dynamic date range and group them by dimension sets i.e. select count(visit_id) where date >= a and date <= b and dimension contains x or dimension contains y and dimension contains z What I am planning is: - Create a theta sketch cube and store them in a hashtable i.e. dynamodb using a workflow orchestration tool like airflow for each date - Retrieve the theta sketch cubes for each day in the date range and do union and intersection on request Here is my question: - I was trying to look at this example: https://github.com/apache/datasketches-cpp/blob/763f9249de576dca8c080fb4f3f438625a332b0b/python/tests/theta_test.py#L20 - For creating the sketches should I be calculating the distinct count group by date and dimension first and use that value with the key being some combination of the dimension and date? - Would the blob I store into the hashtable be the key that I construct with the result returned back by the example generate_theta_sketch method in the example test? - If this is the case, in order to query a date range I would have to construct a union of similar dimensions with different dates within the date range first before I can do any unions/intersections of different dimension values in that date range? Is there an easier way? -- Kevin Peng Chief Engineer, DMP 305.775.2463