Even though filter pushdown is supported in Drill, it is limited to pushing 
down of numeric values including dates. We do not support pushdown of varchar 
because of this bug in the parquet library:

https://issues.apache.org/jira/browse/PARQUET-686

<http://www.mapr.com/>

The issue of correctness for comparison is what makes the dependency on min-max 
statistics by the Parquet library be unreliable.


________________________________
From: Stefán Baxter <[email protected]>
Sent: Monday, May 29, 2017 1:41:30 PM
To: user
Subject: Parquet filter pushdown and string fields that use dictionary encoding

Hi,

I would like to verify that my understanding of parquet filter pushdown in
Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct.

Is it correctly understood that Drill does not support predicate push-down
for string fields when dictionary based string encoding is enabled?  (It
looks like Presto can do this.)

We save a lot of space using dictionary encoding (not enabled in Drill 1.10
by default) and if my understanding of how-it-works is correct then the
segment dictionary could be used to determine if a value is in a segments
or if it can be pruned/skipped when filtering based on columns that are
compressed/encoded using a dictionary.

I may be misunderstanding how this works and perhaps the dictionary is
create for the file as a whole and not individual sections but I know that
min/max values would not be good to determine the need for a segment scan.

I was hoping we could use partitioning on field(s) with lower cardinality
to create partitions for typical partition pruning and then sort the
contents of individual fields by session/customer IDs (which include
alphanumeric characters here) so that segments would only contain a
relatively low number of those unique values to facilitate "segment
pruning" when looking for data belonging to individual sessions/customers.

Best regards,
 -Stefán Baxter

Reply via email to