Even though filter pushdown is supported in Drill, it is limited to pushing down of numeric values including dates. We do not support pushdown of varchar because of this bug in the parquet library:
https://issues.apache.org/jira/browse/PARQUET-686 <http://www.mapr.com/> The issue of correctness for comparison is what makes the dependency on min-max statistics by the Parquet library be unreliable. ________________________________ From: Stefán Baxter <[email protected]> Sent: Monday, May 29, 2017 1:41:30 PM To: user Subject: Parquet filter pushdown and string fields that use dictionary encoding Hi, I would like to verify that my understanding of parquet filter pushdown in Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct. Is it correctly understood that Drill does not support predicate push-down for string fields when dictionary based string encoding is enabled? (It looks like Presto can do this.) We save a lot of space using dictionary encoding (not enabled in Drill 1.10 by default) and if my understanding of how-it-works is correct then the segment dictionary could be used to determine if a value is in a segments or if it can be pruned/skipped when filtering based on columns that are compressed/encoded using a dictionary. I may be misunderstanding how this works and perhaps the dictionary is create for the file as a whole and not individual sections but I know that min/max values would not be good to determine the need for a segment scan. I was hoping we could use partitioning on field(s) with lower cardinality to create partitions for typical partition pruning and then sort the contents of individual fields by session/customer IDs (which include alphanumeric characters here) so that segments would only contain a relatively low number of those unique values to facilitate "segment pruning" when looking for data belonging to individual sessions/customers. Best regards, -Stefán Baxter
