Hi! This came up while trying to ensure HLL sketch interoperability between Apache Hive and Apache Impala.
Currently in Hive the following types are not supported by ds_hll_sketch(): - BOOLEAN - SMALLINT - DECIMAL - TIMESTAMP - DATE These types vary in complexity and usefulness, e.g. BOOLEAN and SMALLINT seem straightforward, while DECIMAL, DATE and TIMESTAMP are often represented in several different ways, so choosing which byte sequence to hash is not self-evident. It is likely that different projects will do this differently, as hashing the native representation is the easiest and fastest. Did these questions already come up in other projects, e.g. how to hash a DATE type in a HLL sketch? If it is a goal to support these in an interoperable way (e.g. a sketch created by a Hive can be used for estimation by Impala), then it would be useful to come up with some recommendations on how what to hash exactly. Some examples to highlight the possible problems: DATE: - int32 days since unix epoch (proleptic gregorian) - string in YYYYMMDD format TIMESTAMP (nanosecond precision): - int128 nanoseconds since unix epoch (UTC, proleptic gregorian) - string in YYYYMMDD HHmmss.sssssssss format DECIMAL(precision, scale): - minimum number of bytes needed to represent range (two's complement) - minimum power of 2 bytes needed to represent range (two's complement) Regards, Csaba