Is there a way to do that during the creation of the parquet table? Might be a hive question but all we do is 'STORED AS parquet' and then during insert set the parquet.* properties. I'm just trying to see if #2 is an option for us to utilize filter pushdown via dfs
On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli < [email protected]> wrote: > I do not know of any plans to support filter pushdown when using the hive > plugin. > If you run analyze stats then hive computes the table stats and stores them > in the hive metastore for the relevant table. I believe drill uses some of > these stats. However running analyze stats command does not alter(or add) > the metadata in the parquet files themselves. The parquet level metadata > should be written when the parquet file itself is created in the first > place. > > - Rahul > > On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <[email protected]> wrote: > > > Rahul, > > > > Thanks for the details. Is there any plans to support filter pushdown > for > > #1? Do you know if we run analyze stats through hive on a parquet file > if > > that will have enough info to do the pushdown? > > > > Thanks again. > > > > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli < > > [email protected]> wrote: > > > > > Sonny, > > > > > > If the underlying data in the hive table is in parquet format, there > are > > 3 > > > ways to query from drill : > > > > > > 1. Using the hive plugin : This does not support filter pushdown for > any > > > formats (ORC, Parquet, Text...etc) > > > 2. Directly Querying the folder in maprfs/hdfs which contains the > parquet > > > files using DFS plugin: With DRILL-1950, we can now do a filter > pushdown > > > into the parquet files. In order to take advantage of this feature, the > > > underlying parquet files should have the relevant stats. This feature > > will > > > only be available with the 1.9.0 release > > > 3. Using the drill's native parquet reader in conjunction with the hive > > > plugin (See store.hive.optimize_scan_with_native_readers) : This > allows > > > drill to fetch all the metadata about the hive table from the metastore > > and > > > then drill uses its own parquet reader for actually reading the files. > > This > > > approach currently does not support parquet filter pushdown but this > > might > > > be added in the next release after 1.9.0. > > > > > > - Rahul > > > > > > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer <[email protected]> > > wrote: > > > > > > > I'm running a drill query with a where clause on a non-partitioned > > column > > > > via hive storage plugin. This query inspects all partitions (kind of > > > > expected), but when i run the same query in Hive I can see a > predicate > > > > passed down to the query plan. This particular query is much faster > in > > > > Hive vs Drill. BTW these are parquet files. > > > > > > > > Hive: > > > > > > > > Stage-0 > > > > > > > > Fetch Operator > > > > > > > > limit:-1 > > > > > > > > Select Operator [SEL_2] > > > > > > > > outputColumnNames:["_col0"] > > > > > > > > Filter Operator [FIL_4] > > > > > > > > predicate:(my_column = 123) (type: boolean) > > > > > > > > TableScan [TS_0] > > > > > > > > alias:my_table > > > > > > > > > > > > Any idea on why this is? My guess is Hive is storing hive specific > > info > > > in > > > > the parquet file since it was created through Hive. Although it > seems > > > > drill-hive plugin should honor this to. Not sure, but willing to > look > > > > through code if someone can point me in the right direction. Thanks! > > > > > > > > -- > > > > > > > > > > > > > > > -- > > > > > > Pushpinder S. Heer > > Senior Software Engineer > > m: 360-434-4354 h: 509-884-2574 > > > -- Pushpinder S. Heer Senior Software Engineer m: 360-434-4354 h: 509-884-2574
