Parquet min/max statistics & null values

Bruno Quinart Thu, 26 Oct 2017 06:02:57 -0700

Hi all

With IMPALA-2328, Parquet row group statistics are now being used to skip the 
row group completely if the min/max range is excluded from the predicate.
We have a use case in which we make sure the data is sorted on a 'key' and have 
then many selective queries on that 'key' field. We notice a significant 
performance increase.
So thanks a lot for all the work on that!


One thing we notice is an unexpected behavior for records where that 'key' has 
null values. It seems that as soon as null values are present in a row group, 
the test on the min/max fails and the row group is read.

We work with Impala 2.9. The data is put in parquet files by Impala itself. We 
have noticed this effect for both bigint as decimal fields. Note that it's 
difficult for me to extract the min/max statistics from the parquet files. The 
parquet-tools included in our distribution (5.12) is not the latest. And I was 
told PARQUET-327 would anyway not print the those row group stats because of 
the way Impala stores them.
We do confirm the expected behavior (exactly one row group read for properly 
sorted data) when we create a similar table but explicitly filter out all null 
values for that 'key' field. We also notice that the the number of row groups 
read (but zero records retained) is proportional to the number of null values.

Is this behavior expected?
Is there a fundamental reason those row groups can not be skipped?

Thanks!
Bruno

Parquet min/max statistics & null values

Reply via email to