Hey, I'm an Apache Impala (distributed, fast, SQL query engine on big data) contributor and recently started working on pulling in HLL sketching from DataSketches. I managed to put a PoC together where Impala runs a count(distinct) estimate on a column of a table where in the background it uses Datasketches' HLL algorithm from apache/incubator-datasketches-cpp to produce the results.
My quick question would be that taking into account that the order of the items provided to datasketches:hll_sketch is not deterministic is it normal behaviour that for the same dataset I get a different estimate each time I run my query? I'm trying to figure out if this is due to some issues with my code or normal characteristics of the C++ library of DataSketches. My second question would be that in case Hive uses the Hive connectors from DataSketches and Impala uses the provided C++ library is it guaranteed that whatever sketch is written by any of these systems it can be correctly read with the other? I see binary compatibility mentioned on the official web page just wanted to double check if there are any exceptions to this. Cheers, Gabor