Parquet filter pushdown and string fields that use dictionary encoding

Stefán Baxter Mon, 29 May 2017 13:42:06 -0700

Hi,

I would like to verify that my understanding of parquet filter pushdown in
Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct.


Is it correctly understood that Drill does not support predicate push-down
for string fields when dictionary based string encoding is enabled?  (It
looks like Presto can do this.)

We save a lot of space using dictionary encoding (not enabled in Drill 1.10
by default) and if my understanding of how-it-works is correct then the
segment dictionary could be used to determine if a value is in a segments
or if it can be pruned/skipped when filtering based on columns that are
compressed/encoded using a dictionary.

I may be misunderstanding how this works and perhaps the dictionary is
create for the file as a whole and not individual sections but I know that
min/max values would not be good to determine the need for a segment scan.

I was hoping we could use partitioning on field(s) with lower cardinality
to create partitions for typical partition pruning and then sort the
contents of individual fields by session/customer IDs (which include
alphanumeric characters here) so that segments would only contain a
relatively low number of those unique values to facilitate "segment
pruning" when looking for data belonging to individual sessions/customers.

Best regards,
 -Stefán Baxter

Parquet filter pushdown and string fields that use dictionary encoding

Reply via email to