This seems quite interesting. Drill does row group pruning, but doing the page level pruning based on indexes would be big win. Also, as you may know, Drill recently added a feature to leverage secondary indexes in NoSQL databases [1]. However, we have to see whether that capability applies to the Parquet index since the Parquet index is local to each file.
Please create a JIRA and add your input into it. Thanks. [1] https://issues.apache.org/jira/browse/DRILL-6381 On Wed, Dec 12, 2018 at 10:30 AM Lou kevin <lou.kev...@gmail.com> wrote: > Hi, I am a drill user and use parquet as the store format. > I have known some new feature has been added to the latest Parquet Format. > The new Parquet feature of column indexes seams very attractive and is > there any plan to be supported in drill? > > thanks very much! > > the feature detail: > https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250 > See https://issues.apache.org/jira/browse/PARQUET-1201 > > And the goals: make both range scans and point lookups I/O efficient by > allowing direct access to pages based on their min and max values. In > particular: > 1.A single-row lookup in a rowgroup based on the sort column of that > rowgroup will only read one data page per retrieved column. Range scans on > the sort column will only need to read the exact data pages that contain > relevant data. > 2.Make other selective scans I/O efficient: if we have a very selective > predicate on a non-sorting column, for the other retrieved columns we > should only need to access data pages that contain matching rows. > 3.No additional decoding effort for scans without selective predicates, > e.g., full-row group scans. If a reader determines that it does not need to > read the index data, it does not incur any overhead. > 4.Index pages for sorted columns use minimal storage by storing only the > boundary elements between pages. >