Re: Making parquet data available to Tableau

Chris Mathews Mon, 28 Sep 2015 09:15:08 -0700

Thank you Rahul for confirmation - I thought I was losing the plot for a while 
there.


Are there any plans for Drill to to utilise the metadata from the footer of the 
parquet files, or even the new metadata cache files, or should a Jira request 
be raised for this as it seems a major step towards simplification for 
reporting tools ?

Cheers — Chris

> On 28 Sep 2015, at 16:50, rahul challapalli <[email protected]> 
> wrote:
> 
> Your observation is right. We need to create a view on top of any
> file/folder for it to be available in Tableau or any reporting tool. This
> makes sense with text and even json formats as drill does not know the data
> types for the fields until it executes the queries. With parquet however
> drill could leverage that information from the footers and make it
> available to reporting tools. But currently it does not do that.
> 
> With the new "REFRESH TABLE METADATA" feature, we collect all the
> information from the parquet footers and store it in a cache file. Even in
> this case, drill does not leverage this information to provide metadata to
> reporting tools
> 
> - Rahul
> 
> On Mon, Sep 28, 2015 at 6:25 AM, Chris Mathews <[email protected]> wrote:
> 
>> Hi
>> 
>> Being new to Drill I am working on a capabilities study to store telecoms
>> probe data as parquet files on an HDFS server, for later analysis and
>> visualisation using Tableau Desktop/Server with Drill and Zookeeper via
>> ODBC/JDBC etc.
>> 
>> We store the parquet files on the HDFS server using an in-house ETL
>> platform, which amongst other things transforms the massive volumes of
>> telecoms probe data into millions of parquet files, writing out the parquet
>> files directly to HDFS using AvroParquetWriter. The probe data arrives at
>> regular intervals (5 to 15 minutes; configurable), so for performance
>> reasons we use this direct AvroParquetWriter approach rather than writing
>> out intermediate files and loading them via the Drill CTAS route.
>> 
>> There has been some success, together with some frustration. After
>> extensive experimentation we have come to the conclusion that to access
>> these parquet files using Tableau we have to configure Drill with
>> individual views for each parquet schema, and cast the columns to specific
>> data types before Tableau can access the data correctly.
>> 
>> This is a surprise as I thought Drill would have some way of exporting the
>> schemas to Tableau having defined AVRO schemas for each parquet file, and
>> the parquet files storing the schema as part of the data.  We now find we
>> have to generate schema definitions in AVRO for the AvroParquetWriter
>> phase, and also a Drill view for each schema to make them visible to
>> Tableau.
>> 
>> Also, as part of our experimentation we did create some parquet files
>> using CTAS. The directory is created and the files contain the data but the
>> tables do not seem to be displayed when we do a SHOW TABLES command.
>> 
>> Are we correct in our thinking about Tableau requiring views to be
>> created, or have we missed something obvious here ?
>> 
>> Will the new REFRESH TABLE METADATA <path to table> feature (Drill 1.2 ?)
>> help us when it becomes available ?
>> 
>> Help and suggestions much appreciated.
>> 
>> Cheers -- Chris
>> 
>>

Re: Making parquet data available to Tableau

Reply via email to