Re: Drill where clause vs Hive on non-partition column

rahul challapalli Mon, 14 Nov 2016 09:52:12 -0800

Sonny,

If the underlying data in the hive table is in parquet format, there are 3
ways to query from drill :

1. Using the hive plugin : This does not support filter pushdown for any
formats (ORC, Parquet, Text...etc)
2. Directly Querying the folder in maprfs/hdfs which contains the parquet
files using DFS plugin: With DRILL-1950, we can now do a filter pushdown
into the parquet files. In order to take advantage of this feature, the
underlying parquet files should have the relevant stats. This feature will
only be available with the 1.9.0 release
3. Using the drill's native parquet reader in conjunction with the hive
plugin (See store.hive.optimize_scan_with_native_readers) : This allows
drill to fetch all the metadata about the hive table from the metastore and
then drill uses its own parquet reader for actually reading the files. This
approach currently does not support parquet filter pushdown but this might
be added in the next release after 1.9.0.

- Rahul

On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer <[email protected]> wrote:

> I'm running a drill query with a where clause on a non-partitioned column
> via hive storage plugin.  This query inspects all partitions (kind of
> expected), but when i run the same query in Hive I can see a predicate
> passed down to the query plan.  This particular query is much faster in
> Hive vs Drill.  BTW these are parquet files.
>
> Hive:
>
> Stage-0
>
> Fetch Operator
>
> limit:-1
>
> Select Operator [SEL_2]
>
> outputColumnNames:["_col0"]
>
> Filter Operator [FIL_4]
>
> predicate:(my_column = 123) (type: boolean)
>
> TableScan [TS_0]
>
> alias:my_table
>
>
> Any idea on why this is?  My guess is Hive is storing hive specific info in
> the parquet file since it was created through Hive.  Although it seems
> drill-hive plugin should honor this to.  Not sure, but willing to look
> through code if someone can point me in the right direction.  Thanks!
>
> --
>

Re: Drill where clause vs Hive on non-partition column

Reply via email to