Hi

Being new to Drill I am working on a capabilities study to store telecoms probe 
data as parquet files on an HDFS server, for later analysis and visualisation 
using Tableau Desktop/Server with Drill and Zookeeper via ODBC/JDBC etc.

We store the parquet files on the HDFS server using an in-house ETL platform, 
which amongst other things transforms the massive volumes of telecoms probe 
data into millions of parquet files, writing out the parquet files directly to 
HDFS using AvroParquetWriter. The probe data arrives at regular intervals (5 to 
15 minutes; configurable), so for performance reasons we use this direct 
AvroParquetWriter approach rather than writing out intermediate files and 
loading them via the Drill CTAS route.

There has been some success, together with some frustration. After extensive 
experimentation we have come to the conclusion that to access these parquet 
files using Tableau we have to configure Drill with individual views for each 
parquet schema, and cast the columns to specific data types before Tableau can 
access the data correctly.

This is a surprise as I thought Drill would have some way of exporting the 
schemas to Tableau having defined AVRO schemas for each parquet file, and the 
parquet files storing the schema as part of the data.  We now find we have to 
generate schema definitions in AVRO for the AvroParquetWriter phase, and also a 
Drill view for each schema to make them visible to Tableau.

Also, as part of our experimentation we did create some parquet files using 
CTAS. The directory is created and the files contain the data but the tables do 
not seem to be displayed when we do a SHOW TABLES command.

Are we correct in our thinking about Tableau requiring views to be created, or 
have we missed something obvious here ?

Will the new REFRESH TABLE METADATA <path to table> feature (Drill 1.2 ?) help 
us when it becomes available ?

Help and suggestions much appreciated.

Cheers -- Chris

Reply via email to