There has been discussion around this in the past. But I am not sure if there is a JIRA open for it. Can you please go ahead and raise a JIRA for this?
- Rahul On Mon, Sep 28, 2015 at 9:13 AM, Chris Mathews <[email protected]> wrote: > Thank you Rahul for confirmation - I thought I was losing the plot for a > while there. > > Are there any plans for Drill to to utilise the metadata from the footer > of the parquet files, or even the new metadata cache files, or should a > Jira request be raised for this as it seems a major step towards > simplification for reporting tools ? > > Cheers — Chris > > > On 28 Sep 2015, at 16:50, rahul challapalli <[email protected]> > wrote: > > > > Your observation is right. We need to create a view on top of any > > file/folder for it to be available in Tableau or any reporting tool. This > > makes sense with text and even json formats as drill does not know the > data > > types for the fields until it executes the queries. With parquet however > > drill could leverage that information from the footers and make it > > available to reporting tools. But currently it does not do that. > > > > With the new "REFRESH TABLE METADATA" feature, we collect all the > > information from the parquet footers and store it in a cache file. Even > in > > this case, drill does not leverage this information to provide metadata > to > > reporting tools > > > > - Rahul > > > > On Mon, Sep 28, 2015 at 6:25 AM, Chris Mathews <[email protected]> wrote: > > > >> Hi > >> > >> Being new to Drill I am working on a capabilities study to store > telecoms > >> probe data as parquet files on an HDFS server, for later analysis and > >> visualisation using Tableau Desktop/Server with Drill and Zookeeper via > >> ODBC/JDBC etc. > >> > >> We store the parquet files on the HDFS server using an in-house ETL > >> platform, which amongst other things transforms the massive volumes of > >> telecoms probe data into millions of parquet files, writing out the > parquet > >> files directly to HDFS using AvroParquetWriter. The probe data arrives > at > >> regular intervals (5 to 15 minutes; configurable), so for performance > >> reasons we use this direct AvroParquetWriter approach rather than > writing > >> out intermediate files and loading them via the Drill CTAS route. > >> > >> There has been some success, together with some frustration. After > >> extensive experimentation we have come to the conclusion that to access > >> these parquet files using Tableau we have to configure Drill with > >> individual views for each parquet schema, and cast the columns to > specific > >> data types before Tableau can access the data correctly. > >> > >> This is a surprise as I thought Drill would have some way of exporting > the > >> schemas to Tableau having defined AVRO schemas for each parquet file, > and > >> the parquet files storing the schema as part of the data. We now find > we > >> have to generate schema definitions in AVRO for the AvroParquetWriter > >> phase, and also a Drill view for each schema to make them visible to > >> Tableau. > >> > >> Also, as part of our experimentation we did create some parquet files > >> using CTAS. The directory is created and the files contain the data but > the > >> tables do not seem to be displayed when we do a SHOW TABLES command. > >> > >> Are we correct in our thinking about Tableau requiring views to be > >> created, or have we missed something obvious here ? > >> > >> Will the new REFRESH TABLE METADATA <path to table> feature (Drill 1.2 > ?) > >> help us when it becomes available ? > >> > >> Help and suggestions much appreciated. > >> > >> Cheers -- Chris > >> > >> > >
