thats a lot of good information Rahul!! - thanks. "modify the query to take advantage of drill's directory based partitioning"
What does this entail? Do you have to tell it on which column the directories are partitioned by? I think option 3 is probably the way to go. Is there a ticket tracking work on this? Thanks again On Tue, Nov 15, 2016 at 10:25 AM, rahul challapalli < [email protected]> wrote: > Robert's suggestion is with using the DFS plugin. If you directly use DFS > instead of hive plugin then > > 1. DFS plugin has to determine the underlying data format on the fly. > 2. DFS plugin does not know the schema in advance. But in the case parquet > drill would get this information from the parquet metadata. However if the > hive table is backed by a csv file, then you cast the columns appropriately > in the query or create a view. > 3. If the underlying hive table is partitioned, then drill does not know > anything about partitions. However since hive partitions are just > sub-directories, you can still modify the query to take advantage of > drill's directory based partitioning > 4. In terms of performance, I am not aware of any published benchmarks > comparing hive plugin and dfs plugin for parquet format. But from my > general experience it appears as though DFS plugin is faster. > > Also do not forget the 3rd option in my first response (Hive Plugin + Drill > native parquet reader). We do have plans to support filter pushdown for > this scenario in the future. > > - Rahul > > On Tue, Nov 15, 2016 at 8:01 AM, Sonny Heer <[email protected]> wrote: > > > Thanks Robert. > > > > "You can then use Drill to query the Hive table and get predicate > pushdown" > > > > This is using the DFS plugin and going directly to the hive table folder? > > > > Can someone speak to what advantages there are to use the hive plugin vs > > going directly to dfs > > > > On Tue, Nov 15, 2016 at 12:32 AM, Robert Hou <[email protected]> wrote: > > > > > I have used Hive 1.2 and I have found that the stats in parquet files > are > > > populated for some data types. Integer, bigint, float, double, date > > work. > > > String does not seem to work. > > > > > > You can then use Drill to query the Hive table and get predicate > pushdown > > > for simple compare filters. This has the form "where col = value". > > Other > > > standard operators are !=, <, <=, >, >=. Compound filters can use > > "and/or" > > > logic. This will be supported in Drill 1.9. > > > > > > In the future, we will add expressions and functions. > > > > > > Thanks. > > > > > > --Robert > > > > > > > > > On Mon, Nov 14, 2016 at 3:53 PM, Sonny Heer <[email protected]> > wrote: > > > > > > > Is there a way to do that during the creation of the parquet table? > > > Might > > > > be a hive question but all we do is 'STORED AS parquet' and then > during > > > > insert set the parquet.* properties. I'm just trying to see if #2 is > > an > > > > option for us to utilize filter pushdown via dfs > > > > > > > > On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli < > > > > [email protected]> wrote: > > > > > > > > > I do not know of any plans to support filter pushdown when using > the > > > hive > > > > > plugin. > > > > > If you run analyze stats then hive computes the table stats and > > stores > > > > them > > > > > in the hive metastore for the relevant table. I believe drill uses > > some > > > > of > > > > > these stats. However running analyze stats command does not > alter(or > > > add) > > > > > the metadata in the parquet files themselves. The parquet level > > > metadata > > > > > should be written when the parquet file itself is created in the > > first > > > > > place. > > > > > > > > > > - Rahul > > > > > > > > > > On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <[email protected]> > > > wrote: > > > > > > > > > > > Rahul, > > > > > > > > > > > > Thanks for the details. Is there any plans to support filter > > > pushdown > > > > > for > > > > > > #1? Do you know if we run analyze stats through hive on a > parquet > > > file > > > > > if > > > > > > that will have enough info to do the pushdown? > > > > > > > > > > > > Thanks again. > > > > > > > > > > > > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli < > > > > > > [email protected]> wrote: > > > > > > > > > > > > > Sonny, > > > > > > > > > > > > > > If the underlying data in the hive table is in parquet format, > > > there > > > > > are > > > > > > 3 > > > > > > > ways to query from drill : > > > > > > > > > > > > > > 1. Using the hive plugin : This does not support filter > pushdown > > > for > > > > > any > > > > > > > formats (ORC, Parquet, Text...etc) > > > > > > > 2. Directly Querying the folder in maprfs/hdfs which contains > the > > > > > parquet > > > > > > > files using DFS plugin: With DRILL-1950, we can now do a filter > > > > > pushdown > > > > > > > into the parquet files. In order to take advantage of this > > feature, > > > > the > > > > > > > underlying parquet files should have the relevant stats. This > > > feature > > > > > > will > > > > > > > only be available with the 1.9.0 release > > > > > > > 3. Using the drill's native parquet reader in conjunction with > > the > > > > hive > > > > > > > plugin (See store.hive.optimize_scan_with_native_readers) : > This > > > > > allows > > > > > > > drill to fetch all the metadata about the hive table from the > > > > metastore > > > > > > and > > > > > > > then drill uses its own parquet reader for actually reading the > > > > files. > > > > > > This > > > > > > > approach currently does not support parquet filter pushdown but > > > this > > > > > > might > > > > > > > be added in the next release after 1.9.0. > > > > > > > > > > > > > > - Rahul > > > > > > > > > > > > > > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > I'm running a drill query with a where clause on a > > > non-partitioned > > > > > > column > > > > > > > > via hive storage plugin. This query inspects all partitions > > > (kind > > > > of > > > > > > > > expected), but when i run the same query in Hive I can see a > > > > > predicate > > > > > > > > passed down to the query plan. This particular query is much > > > > faster > > > > > in > > > > > > > > Hive vs Drill. BTW these are parquet files. > > > > > > > > > > > > > > > > Hive: > > > > > > > > > > > > > > > > Stage-0 > > > > > > > > > > > > > > > > Fetch Operator > > > > > > > > > > > > > > > > limit:-1 > > > > > > > > > > > > > > > > Select Operator [SEL_2] > > > > > > > > > > > > > > > > outputColumnNames:["_col0"] > > > > > > > > > > > > > > > > Filter Operator [FIL_4] > > > > > > > > > > > > > > > > predicate:(my_column = 123) (type: boolean) > > > > > > > > > > > > > > > > TableScan [TS_0] > > > > > > > > > > > > > > > > alias:my_table > > > > > > > > > > > > > > > > > > > > > > > > Any idea on why this is? My guess is Hive is storing hive > > > specific > > > > > > info > > > > > > > in > > > > > > > > the parquet file since it was created through Hive. Although > > it > > > > > seems > > > > > > > > drill-hive plugin should honor this to. Not sure, but > willing > > to > > > > > look > > > > > > > > through code if someone can point me in the right direction. > > > > Thanks! > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > Pushpinder S. Heer > > > > > > Senior Software Engineer > > > > > > m: 360-434-4354 h: 509-884-2574 > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Pushpinder S. Heer > > > > Senior Software Engineer > > > > m: 360-434-4354 h: 509-884-2574 > > > > > > > > > > > > > > > -- > > > > > > Pushpinder S. Heer > > Senior Software Engineer > > m: 360-434-4354 h: 509-884-2574 > > > -- Pushpinder S. Heer Senior Software Engineer m: 360-434-4354 h: 509-884-2574
