Re: Making parquet data available to Tableau

rahul challapalli Mon, 28 Sep 2015 10:07:57 -0700

There has been discussion around this in the past. But I am not sure if
there is a JIRA open for it. Can you please go ahead and raise a JIRA for
this?


- Rahul

On Mon, Sep 28, 2015 at 9:13 AM, Chris Mathews <[email protected]> wrote:

> Thank you Rahul for confirmation - I thought I was losing the plot for a
> while there.
>
> Are there any plans for Drill to to utilise the metadata from the footer
> of the parquet files, or even the new metadata cache files, or should a
> Jira request be raised for this as it seems a major step towards
> simplification for reporting tools ?
>
> Cheers — Chris
>
> > On 28 Sep 2015, at 16:50, rahul challapalli <[email protected]>
> wrote:
> >
> > Your observation is right. We need to create a view on top of any
> > file/folder for it to be available in Tableau or any reporting tool. This
> > makes sense with text and even json formats as drill does not know the
> data
> > types for the fields until it executes the queries. With parquet however
> > drill could leverage that information from the footers and make it
> > available to reporting tools. But currently it does not do that.
> >
> > With the new "REFRESH TABLE METADATA" feature, we collect all the
> > information from the parquet footers and store it in a cache file. Even
> in
> > this case, drill does not leverage this information to provide metadata
> to
> > reporting tools
> >
> > - Rahul
> >
> > On Mon, Sep 28, 2015 at 6:25 AM, Chris Mathews <[email protected]> wrote:
> >
> >> Hi
> >>
> >> Being new to Drill I am working on a capabilities study to store
> telecoms
> >> probe data as parquet files on an HDFS server, for later analysis and
> >> visualisation using Tableau Desktop/Server with Drill and Zookeeper via
> >> ODBC/JDBC etc.
> >>
> >> We store the parquet files on the HDFS server using an in-house ETL
> >> platform, which amongst other things transforms the massive volumes of
> >> telecoms probe data into millions of parquet files, writing out the
> parquet
> >> files directly to HDFS using AvroParquetWriter. The probe data arrives
> at
> >> regular intervals (5 to 15 minutes; configurable), so for performance
> >> reasons we use this direct AvroParquetWriter approach rather than
> writing
> >> out intermediate files and loading them via the Drill CTAS route.
> >>
> >> There has been some success, together with some frustration. After
> >> extensive experimentation we have come to the conclusion that to access
> >> these parquet files using Tableau we have to configure Drill with
> >> individual views for each parquet schema, and cast the columns to
> specific
> >> data types before Tableau can access the data correctly.
> >>
> >> This is a surprise as I thought Drill would have some way of exporting
> the
> >> schemas to Tableau having defined AVRO schemas for each parquet file,
> and
> >> the parquet files storing the schema as part of the data.  We now find
> we
> >> have to generate schema definitions in AVRO for the AvroParquetWriter
> >> phase, and also a Drill view for each schema to make them visible to
> >> Tableau.
> >>
> >> Also, as part of our experimentation we did create some parquet files
> >> using CTAS. The directory is created and the files contain the data but
> the
> >> tables do not seem to be displayed when we do a SHOW TABLES command.
> >>
> >> Are we correct in our thinking about Tableau requiring views to be
> >> created, or have we missed something obvious here ?
> >>
> >> Will the new REFRESH TABLE METADATA <path to table> feature (Drill 1.2
> ?)
> >> help us when it becomes available ?
> >>
> >> Help and suggestions much appreciated.
> >>
> >> Cheers -- Chris
> >>
> >>
>
>

Re: Making parquet data available to Tableau

Reply via email to