Hi Bruno, Parquet-mr currently doesn't support the new stats fields: https://issues.apache.org/jira/browse/PARQUET-1025
For development purposes I used a small Python script a while ago. I pushed it here so you can check it out if you'd like, but it's really only meant for development purposes: https://github.com/lekv/impala/tree/dump_parquet_stats Cheers, Lars On Sat, Oct 28, 2017 at 1:19 PM, Bruno Quinart <[email protected]> wrote: > Hi > > Thanks for your replies. > > An example query: > SELECT * FROM reasonably_sized_partitioned_table > WHERE partition_key_dtnr = 20171028 > AND sorted_column = 56789123456 > (same effect with an IN instead of the last equality) > > I checked the behavior in our cluster again and can now confirm Lars > explanation: > - Filtering on stats works well as from 2.9, except when there is only > null values for the column in a parquet file > - And that would be addressed by IMPALA-6113 > I was wrongly assuming that in some of my tables there would not have been > enough null values to fill a parquet file for each partition. > > I wonder how you verify those stats in Parquet metadata. Parquet-tools > currently does not print them. > > Thanks for the help > Bruno > > On 27 Oct, 2017, at 12:54 AM, Lars Volker wrote: > > Hi Bruno, > > To clarify on your original observation: Column chunks that only have null > values in them will not have their min_value and max_value fields populated > and thus won't be skipped based on stats. I filed IMPALA-6113 > <https://issues.apache.org/jira/browse/IMPALA-6113> to track this. > IMPALA-5061 <https://issues.apache.org/jira/browse/IMPALA-5061> added > support to populate the null_count in statistics, allowing us to detect > column chunks that only contain NULLs. We should use that information to > skip row groups if the predicate allows us to. > > Row groups with column chunks that have at least one non-null value should > get filtered correctly. > > Cheers, Lars > > On Thu, Oct 26, 2017 at 10:02 AM, Tim Armstrong wrote: > > Hi Bruno, > > Could you provide an example of the specific predicates that aren't being > > used to successfully skip the row group? > > > - Tim > > > On Thu, Oct 26, 2017 at 7:21 AM, Jeszy wrote: > > > Hello Bruno, > > > Thanks for bringing this up. While not apparent from the commit > > comments, this limitation was mentioned during the code review: > > 'min/max are only set when there are non-null values, so we don't > > consider statistics for "is null".' (see > > https://gerrit.cloudera.org/#/c/6147/). > > It looks to me that this was intended, but I'll let others confirm. > > Definitely a point where we can improve. > > > Thanks! > > > On 26 October 2017 at 08:02, Bruno Quinart wrote: > > > Hi all > > > > > > With IMPALA-2328, Parquet row group statistics are now being used to > > skip > > > the row group completely if the min/max range is excluded from the > > > predicate. > > > We have a use case in which we make sure the data is sorted on a 'key' > > and > > > have then many selective queries on that 'key' field. We notice a > > > significant performance increase. > > > So thanks a lot for all the work on that! > > > > > > One thing we notice is an unexpected behavior for records where that > > 'key' > > > has null values. It seems that as soon as null values are present in a > > row > > > group, the test on the min/max fails and the row group is read. > > > > > > We work with Impala 2.9. The data is put in parquet files by Impala > > itself. > > > We have noticed this effect for both bigint as decimal fields. Note that > > > it's difficult for me to extract the min/max statistics from the parquet > > > files. The parquet-tools included in our distribution (5.12) is not the > > > latest. And I was told PARQUET-327 would anyway not print the those row > > > group stats because of the way Impala stores them. > > > We do confirm the expected behavior (exactly one row group read for > > properly > > > sorted data) when we create a similar table but explicitly filter out > > all > > > null values for that 'key' field. We also notice that the the number of > > row > > > groups read (but zero records retained) is proportional to the number of > > > null values. > > > > > > Is this behavior expected? > > > Is there a fundamental reason those row groups can not be skipped? > > > > > > Thanks! > > > Bruno > > > > > > > >
