Thank you Kunal. Kan you please explain to me why min/max values would be relevant for dictionary encoded fields? (I think I may be completely misunderstanding how they work)
Regards, -Stefán On Wed, May 31, 2017 at 5:55 PM, Kunal Khatua <[email protected]> wrote: > Even though filter pushdown is supported in Drill, it is limited to > pushing down of numeric values including dates. We do not support pushdown > of varchar because of this bug in the parquet library: > > https://issues.apache.org/jira/browse/PARQUET-686 > > <http://www.mapr.com/> > > The issue of correctness for comparison is what makes the dependency on > min-max statistics by the Parquet library be unreliable. > > > ________________________________ > From: Stefán Baxter <[email protected]> > Sent: Monday, May 29, 2017 1:41:30 PM > To: user > Subject: Parquet filter pushdown and string fields that use dictionary > encoding > > Hi, > > I would like to verify that my understanding of parquet filter pushdown in > Drill (https://drill.apache.org/docs/parquet-filter-pushdown/) is correct. > > Is it correctly understood that Drill does not support predicate push-down > for string fields when dictionary based string encoding is enabled? (It > looks like Presto can do this.) > > We save a lot of space using dictionary encoding (not enabled in Drill 1.10 > by default) and if my understanding of how-it-works is correct then the > segment dictionary could be used to determine if a value is in a segments > or if it can be pruned/skipped when filtering based on columns that are > compressed/encoded using a dictionary. > > I may be misunderstanding how this works and perhaps the dictionary is > create for the file as a whole and not individual sections but I know that > min/max values would not be good to determine the need for a segment scan. > > I was hoping we could use partitioning on field(s) with lower cardinality > to create partitions for typical partition pruning and then sort the > contents of individual fields by session/customer IDs (which include > alphanumeric characters here) so that segments would only contain a > relatively low number of those unique values to facilitate "segment > pruning" when looking for data belonging to individual sessions/customers. > > Best regards, > -Stefán Baxter >
