Hi all With IMPALA-2328, Parquet row group statistics are now being used to skip the row group completely if the min/max range is excluded from the predicate. We have a use case in which we make sure the data is sorted on a 'key' and have then many selective queries on that 'key' field. We notice a significant performance increase. So thanks a lot for all the work on that!
One thing we notice is an unexpected behavior for records where that 'key' has null values. It seems that as soon as null values are present in a row group, the test on the min/max fails and the row group is read. We work with Impala 2.9. The data is put in parquet files by Impala itself. We have noticed this effect for both bigint as decimal fields. Note that it's difficult for me to extract the min/max statistics from the parquet files. The parquet-tools included in our distribution (5.12) is not the latest. And I was told PARQUET-327 would anyway not print the those row group stats because of the way Impala stores them. We do confirm the expected behavior (exactly one row group read for properly sorted data) when we create a similar table but explicitly filter out all null values for that 'key' field. We also notice that the the number of row groups read (but zero records retained) is proportional to the number of null values. Is this behavior expected? Is there a fundamental reason those row groups can not be skipped? Thanks! Bruno
