Hi Alex, Lee and Theta sketches team,

Most of the discussion in this forum has been on sketches, druid, java, PIG
latin, Hadoop, Hive and so on. I would like to know if there is a forum for
research and open source work on applications of Theta Sketches.

Based on the research site [
https://datasketches.apache.org/docs/Community/Research.html ]   for theta
sketches there is a paper [ https://dl.acm.org/doi/10.1145/2902251.2902278
too math intensive ] on Frequent Itemset extraction using theta sketches
which is a huge area of interest to me and which IMO has the potential to
make analytics really fast [ Think beyond SQL]. [
https://dl.acm.org/doi/pdf/10.1145/2902251.2902278  ]

*[LMTU16] Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan
Ullman. Space lower bounds for itemset frequency sketches. In ACM PODS
Proceedings ‘16, pages 441–454, 2016. *


I have experimented with a couple of datasets including population
healthcare analysis. The experiments that I have conducted have been very
promising. Here is a short talk by me on the same
https://www.youtube.com/watch?v=InnNL15B4cw&t=3136s .
I have java code that I did as an experiment but it is not well documented
and without a readme :(. Beware reading this [
https://github.com/vijaysrajan/SketchAnalysisOnHealthcare ] could make your
head spin. Happy to jump on a call to explain this idea.

Apart from frequent itemsets[Apriori-Tid], I believe that decision trees
like c5.0 for binary classification on big data and a few more classical
algorithms can be implemented using theta sketches.

Questions to this group,
1. Would folks be interested in implementing open source libraries ?
2. Would folks be interested in running experiments on real datasets
including perhaps at Yahoo to benchmark the results?
3. Would folks be interested in publishing some papers?

Regards
Vijay

Reply via email to