Thank you Rahul for confirmation - I thought I was losing the plot for a while there.
Are there any plans for Drill to to utilise the metadata from the footer of the parquet files, or even the new metadata cache files, or should a Jira request be raised for this as it seems a major step towards simplification for reporting tools ? Cheers — Chris > On 28 Sep 2015, at 16:50, rahul challapalli <[email protected]> > wrote: > > Your observation is right. We need to create a view on top of any > file/folder for it to be available in Tableau or any reporting tool. This > makes sense with text and even json formats as drill does not know the data > types for the fields until it executes the queries. With parquet however > drill could leverage that information from the footers and make it > available to reporting tools. But currently it does not do that. > > With the new "REFRESH TABLE METADATA" feature, we collect all the > information from the parquet footers and store it in a cache file. Even in > this case, drill does not leverage this information to provide metadata to > reporting tools > > - Rahul > > On Mon, Sep 28, 2015 at 6:25 AM, Chris Mathews <[email protected]> wrote: > >> Hi >> >> Being new to Drill I am working on a capabilities study to store telecoms >> probe data as parquet files on an HDFS server, for later analysis and >> visualisation using Tableau Desktop/Server with Drill and Zookeeper via >> ODBC/JDBC etc. >> >> We store the parquet files on the HDFS server using an in-house ETL >> platform, which amongst other things transforms the massive volumes of >> telecoms probe data into millions of parquet files, writing out the parquet >> files directly to HDFS using AvroParquetWriter. The probe data arrives at >> regular intervals (5 to 15 minutes; configurable), so for performance >> reasons we use this direct AvroParquetWriter approach rather than writing >> out intermediate files and loading them via the Drill CTAS route. >> >> There has been some success, together with some frustration. After >> extensive experimentation we have come to the conclusion that to access >> these parquet files using Tableau we have to configure Drill with >> individual views for each parquet schema, and cast the columns to specific >> data types before Tableau can access the data correctly. >> >> This is a surprise as I thought Drill would have some way of exporting the >> schemas to Tableau having defined AVRO schemas for each parquet file, and >> the parquet files storing the schema as part of the data. We now find we >> have to generate schema definitions in AVRO for the AvroParquetWriter >> phase, and also a Drill view for each schema to make them visible to >> Tableau. >> >> Also, as part of our experimentation we did create some parquet files >> using CTAS. The directory is created and the files contain the data but the >> tables do not seem to be displayed when we do a SHOW TABLES command. >> >> Are we correct in our thinking about Tableau requiring views to be >> created, or have we missed something obvious here ? >> >> Will the new REFRESH TABLE METADATA <path to table> feature (Drill 1.2 ?) >> help us when it becomes available ? >> >> Help and suggestions much appreciated. >> >> Cheers -- Chris >> >>
