Hi, I am a drill user and use parquet as the store format. I have known some new feature has been added to the latest Parquet Format. The new Parquet feature of column indexes seams very attractive and is there any plan to be supported in drill?
thanks very much! the feature detail: https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250 See https://issues.apache.org/jira/browse/PARQUET-1201 And the goals: make both range scans and point lookups I/O efficient by allowing direct access to pages based on their min and max values. In particular: 1.A single-row lookup in a rowgroup based on the sort column of that rowgroup will only read one data page per retrieved column. Range scans on the sort column will only need to read the exact data pages that contain relevant data. 2.Make other selective scans I/O efficient: if we have a very selective predicate on a non-sorting column, for the other retrieved columns we should only need to access data pages that contain matching rows. 3.No additional decoding effort for scans without selective predicates, e.g., full-row group scans. If a reader determines that it does not need to read the index data, it does not incur any overhead. 4.Index pages for sorted columns use minimal storage by storing only the boundary elements between pages.