Hi Alex, Lee and Theta sketches team, Most of the discussion in this forum has been on sketches, druid, java, PIG latin, Hadoop, Hive and so on. I would like to know if there is a forum for research and open source work on applications of Theta Sketches.
Based on the research site [ https://datasketches.apache.org/docs/Community/Research.html ] for theta sketches there is a paper [ https://dl.acm.org/doi/10.1145/2902251.2902278 too math intensive ] on Frequent Itemset extraction using theta sketches which is a huge area of interest to me and which IMO has the potential to make analytics really fast [ Think beyond SQL]. [ https://dl.acm.org/doi/pdf/10.1145/2902251.2902278 ] *[LMTU16] Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman. Space lower bounds for itemset frequency sketches. In ACM PODS Proceedings ‘16, pages 441–454, 2016. * I have experimented with a couple of datasets including population healthcare analysis. The experiments that I have conducted have been very promising. Here is a short talk by me on the same https://www.youtube.com/watch?v=InnNL15B4cw&t=3136s . I have java code that I did as an experiment but it is not well documented and without a readme :(. Beware reading this [ https://github.com/vijaysrajan/SketchAnalysisOnHealthcare ] could make your head spin. Happy to jump on a call to explain this idea. Apart from frequent itemsets[Apriori-Tid], I believe that decision trees like c5.0 for binary classification on big data and a few more classical algorithms can be implemented using theta sketches. Questions to this group, 1. Would folks be interested in implementing open source libraries ? 2. Would folks be interested in running experiments on real datasets including perhaps at Yahoo to benchmark the results? 3. Would folks be interested in publishing some papers? Regards Vijay