Apache Impala integration with DataSketches HLL (C++)

Gabor Kaszab Mon, 27 Apr 2020 06:19:42 -0700

Hey,

I'm an Apache Impala (distributed, fast, SQL query engine on big data)
contributor and recently started working on pulling in HLL sketching from
DataSketches. I managed to put a PoC together where Impala runs a
count(distinct) estimate on a column of a table where in the background it
uses Datasketches' HLL algorithm from apache/incubator-datasketches-cpp to
produce the results.


My quick question would be that taking into account that the order of the
items provided to datasketches:hll_sketch is not deterministic is it normal
behaviour that for the same dataset I get a different estimate each time I
run my query?
I'm trying to figure out if this is due to some issues with my code or
normal characteristics of the C++ library of DataSketches.

My second question would be that in case Hive uses the Hive connectors from
DataSketches and Impala uses the provided C++ library is it guaranteed that
whatever sketch is written by any of these systems it can be correctly read
with the other? I see binary compatibility mentioned on the official web
page just wanted to double check if there are any exceptions to this.

Cheers,
Gabor

Apache Impala integration with DataSketches HLL (C++)

Reply via email to