Re: Drill where clause vs Hive on non-partition column

Sonny Heer Wed, 16 Nov 2016 07:25:16 -0800

thats a lot of good information Rahul!! - thanks.

"modify the query to take advantage of drill's directory based partitioning"


What does this entail?  Do you have to tell it on which column the
directories are partitioned by?

I think option 3 is probably the way to go.  Is there a ticket tracking
work on this?

Thanks again

On Tue, Nov 15, 2016 at 10:25 AM, rahul challapalli <
[email protected]> wrote:

> Robert's suggestion is with using the DFS plugin. If you directly use DFS
> instead of hive plugin then
>
> 1. DFS plugin has to determine the underlying data format on the fly.
> 2. DFS plugin does not know the schema in advance. But in the case parquet
> drill would get this information from the parquet metadata. However if the
> hive table is backed by a csv file, then you cast the columns appropriately
> in the query or create a view.
> 3. If the underlying hive table is partitioned, then drill does not know
> anything about partitions. However since hive partitions are just
> sub-directories, you can still modify the query to take advantage of
> drill's directory based partitioning
> 4. In terms of performance, I am not aware of any published benchmarks
> comparing hive plugin and dfs plugin for parquet format. But from my
> general experience it appears as though DFS plugin is faster.
>
> Also do not forget the 3rd option in my first response (Hive Plugin + Drill
> native parquet reader). We do have plans to support filter pushdown for
> this scenario in the future.
>
> - Rahul
>
> On Tue, Nov 15, 2016 at 8:01 AM, Sonny Heer <[email protected]> wrote:
>
> > Thanks Robert.
> >
> > "You can then use Drill to query the Hive table and get predicate
> pushdown"
> >
> > This is using the DFS plugin and going directly to the hive table folder?
> >
> > Can someone speak to what advantages there are to use the hive plugin vs
> > going directly to dfs
> >
> > On Tue, Nov 15, 2016 at 12:32 AM, Robert Hou <[email protected]> wrote:
> >
> > > I have used Hive 1.2 and I have found that the stats in parquet files
> are
> > > populated for some data types.  Integer, bigint, float, double, date
> > work.
> > > String does not seem to work.
> > >
> > > You can then use Drill to query the Hive table and get predicate
> pushdown
> > > for simple compare filters.  This has the form "where col = value".
> > Other
> > > standard operators are !=, <, <=, >, >=.  Compound filters can use
> > "and/or"
> > > logic.  This will be supported in Drill 1.9.
> > >
> > > In the future, we will add expressions and functions.
> > >
> > > Thanks.
> > >
> > > --Robert
> > >
> > >
> > > On Mon, Nov 14, 2016 at 3:53 PM, Sonny Heer <[email protected]>
> wrote:
> > >
> > > > Is there a way to do that during the creation of the parquet table?
> > > Might
> > > > be a hive question but all we do is 'STORED AS parquet' and then
> during
> > > > insert set the parquet.* properties.  I'm just trying to see if #2 is
> > an
> > > > option for us to utilize filter pushdown via dfs
> > > >
> > > > On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli <
> > > > [email protected]> wrote:
> > > >
> > > > > I do not know of any plans to support filter pushdown when using
> the
> > > hive
> > > > > plugin.
> > > > > If you run analyze stats then hive computes the table stats and
> > stores
> > > > them
> > > > > in the hive metastore for the relevant table. I believe drill uses
> > some
> > > > of
> > > > > these stats. However running analyze stats command does not
> alter(or
> > > add)
> > > > > the metadata in the parquet files themselves. The parquet level
> > > metadata
> > > > > should be written when the parquet file itself is created in the
> > first
> > > > > place.
> > > > >
> > > > > - Rahul
> > > > >
> > > > > On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <[email protected]>
> > > wrote:
> > > > >
> > > > > > Rahul,
> > > > > >
> > > > > > Thanks for the details.  Is there any plans to support filter
> > > pushdown
> > > > > for
> > > > > > #1?  Do you know if we run analyze stats through hive on a
> parquet
> > > file
> > > > > if
> > > > > > that will have enough info to do the pushdown?
> > > > > >
> > > > > > Thanks again.
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Sonny,
> > > > > > >
> > > > > > > If the underlying data in the hive table is in parquet format,
> > > there
> > > > > are
> > > > > > 3
> > > > > > > ways to query from drill :
> > > > > > >
> > > > > > > 1. Using the hive plugin : This does not support filter
> pushdown
> > > for
> > > > > any
> > > > > > > formats (ORC, Parquet, Text...etc)
> > > > > > > 2. Directly Querying the folder in maprfs/hdfs which contains
> the
> > > > > parquet
> > > > > > > files using DFS plugin: With DRILL-1950, we can now do a filter
> > > > > pushdown
> > > > > > > into the parquet files. In order to take advantage of this
> > feature,
> > > > the
> > > > > > > underlying parquet files should have the relevant stats. This
> > > feature
> > > > > > will
> > > > > > > only be available with the 1.9.0 release
> > > > > > > 3. Using the drill's native parquet reader in conjunction with
> > the
> > > > hive
> > > > > > > plugin (See store.hive.optimize_scan_with_native_readers) :
> This
> > > > > allows
> > > > > > > drill to fetch all the metadata about the hive table from the
> > > > metastore
> > > > > > and
> > > > > > > then drill uses its own parquet reader for actually reading the
> > > > files.
> > > > > > This
> > > > > > > approach currently does not support parquet filter pushdown but
> > > this
> > > > > > might
> > > > > > > be added in the next release after 1.9.0.
> > > > > > >
> > > > > > > - Rahul
> > > > > > >
> > > > > > > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > I'm running a drill query with a where clause on a
> > > non-partitioned
> > > > > > column
> > > > > > > > via hive storage plugin.  This query inspects all partitions
> > > (kind
> > > > of
> > > > > > > > expected), but when i run the same query in Hive I can see a
> > > > > predicate
> > > > > > > > passed down to the query plan.  This particular query is much
> > > > faster
> > > > > in
> > > > > > > > Hive vs Drill.  BTW these are parquet files.
> > > > > > > >
> > > > > > > > Hive:
> > > > > > > >
> > > > > > > > Stage-0
> > > > > > > >
> > > > > > > > Fetch Operator
> > > > > > > >
> > > > > > > > limit:-1
> > > > > > > >
> > > > > > > > Select Operator [SEL_2]
> > > > > > > >
> > > > > > > > outputColumnNames:["_col0"]
> > > > > > > >
> > > > > > > > Filter Operator [FIL_4]
> > > > > > > >
> > > > > > > > predicate:(my_column = 123) (type: boolean)
> > > > > > > >
> > > > > > > > TableScan [TS_0]
> > > > > > > >
> > > > > > > > alias:my_table
> > > > > > > >
> > > > > > > >
> > > > > > > > Any idea on why this is?  My guess is Hive is storing hive
> > > specific
> > > > > > info
> > > > > > > in
> > > > > > > > the parquet file since it was created through Hive.  Although
> > it
> > > > > seems
> > > > > > > > drill-hive plugin should honor this to.  Not sure, but
> willing
> > to
> > > > > look
> > > > > > > > through code if someone can point me in the right direction.
> > > > Thanks!
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > >
> > > > > > Pushpinder S. Heer
> > > > > > Senior Software Engineer
> > > > > > m: 360-434-4354 h: 509-884-2574
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > >
> > > > Pushpinder S. Heer
> > > > Senior Software Engineer
> > > > m: 360-434-4354 h: 509-884-2574
> > > >
> > >
> >
> >
> >
> > --
> >
> >
> > Pushpinder S. Heer
> > Senior Software Engineer
> > m: 360-434-4354 h: 509-884-2574
> >
>



-- 


Pushpinder S. Heer
Senior Software Engineer
m: 360-434-4354 h: 509-884-2574

Re: Drill where clause vs Hive on non-partition column

Reply via email to