Re: Drill where clause vs Hive on non-partition column

Sonny Heer Mon, 14 Nov 2016 15:54:06 -0800

Is there a way to do that during the creation of the parquet table?  Might
be a hive question but all we do is 'STORED AS parquet' and then during
insert set the parquet.* properties.  I'm just trying to see if #2 is an
option for us to utilize filter pushdown via dfs


On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli <
[email protected]> wrote:

> I do not know of any plans to support filter pushdown when using the hive
> plugin.
> If you run analyze stats then hive computes the table stats and stores them
> in the hive metastore for the relevant table. I believe drill uses some of
> these stats. However running analyze stats command does not alter(or add)
> the metadata in the parquet files themselves. The parquet level metadata
> should be written when the parquet file itself is created in the first
> place.
>
> - Rahul
>
> On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <[email protected]> wrote:
>
> > Rahul,
> >
> > Thanks for the details.  Is there any plans to support filter pushdown
> for
> > #1?  Do you know if we run analyze stats through hive on a parquet file
> if
> > that will have enough info to do the pushdown?
> >
> > Thanks again.
> >
> > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli <
> > [email protected]> wrote:
> >
> > > Sonny,
> > >
> > > If the underlying data in the hive table is in parquet format, there
> are
> > 3
> > > ways to query from drill :
> > >
> > > 1. Using the hive plugin : This does not support filter pushdown for
> any
> > > formats (ORC, Parquet, Text...etc)
> > > 2. Directly Querying the folder in maprfs/hdfs which contains the
> parquet
> > > files using DFS plugin: With DRILL-1950, we can now do a filter
> pushdown
> > > into the parquet files. In order to take advantage of this feature, the
> > > underlying parquet files should have the relevant stats. This feature
> > will
> > > only be available with the 1.9.0 release
> > > 3. Using the drill's native parquet reader in conjunction with the hive
> > > plugin (See store.hive.optimize_scan_with_native_readers) : This
> allows
> > > drill to fetch all the metadata about the hive table from the metastore
> > and
> > > then drill uses its own parquet reader for actually reading the files.
> > This
> > > approach currently does not support parquet filter pushdown but this
> > might
> > > be added in the next release after 1.9.0.
> > >
> > > - Rahul
> > >
> > > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer <[email protected]>
> > wrote:
> > >
> > > > I'm running a drill query with a where clause on a non-partitioned
> > column
> > > > via hive storage plugin.  This query inspects all partitions (kind of
> > > > expected), but when i run the same query in Hive I can see a
> predicate
> > > > passed down to the query plan.  This particular query is much faster
> in
> > > > Hive vs Drill.  BTW these are parquet files.
> > > >
> > > > Hive:
> > > >
> > > > Stage-0
> > > >
> > > > Fetch Operator
> > > >
> > > > limit:-1
> > > >
> > > > Select Operator [SEL_2]
> > > >
> > > > outputColumnNames:["_col0"]
> > > >
> > > > Filter Operator [FIL_4]
> > > >
> > > > predicate:(my_column = 123) (type: boolean)
> > > >
> > > > TableScan [TS_0]
> > > >
> > > > alias:my_table
> > > >
> > > >
> > > > Any idea on why this is?  My guess is Hive is storing hive specific
> > info
> > > in
> > > > the parquet file since it was created through Hive.  Although it
> seems
> > > > drill-hive plugin should honor this to.  Not sure, but willing to
> look
> > > > through code if someone can point me in the right direction.  Thanks!
> > > >
> > > > --
> > > >
> > >
> >
> >
> >
> > --
> >
> >
> > Pushpinder S. Heer
> > Senior Software Engineer
> > m: 360-434-4354 h: 509-884-2574
> >
>



-- 


Pushpinder S. Heer
Senior Software Engineer
m: 360-434-4354 h: 509-884-2574

Re: Drill where clause vs Hive on non-partition column

Reply via email to