Re: Reading Parquet Files Created Elsewhere

John Omernik Mon, 23 May 2016 07:36:07 -0700

I am learning more about my data here, the data was created in a CDH
version of the apache parquet-mr library. (Not sure version yet, getting
that soon).  They used snappy and version 1.0 of the Parquet spec due to
Impala needing it.  They are also using setEnableDictionary on the write.


Trying to figure things out right now

If I make a view and cast all string fields to a VARCHAR drill shows the
right result, but it's slow.

(10 row select from raw = 1.9 seconds, 10 row select with CAST in a view =
25 seconds)

I've resigned myself to converting the table once for performance, which
isn't an issue however I am getting different issues on that front  (I'll
open a new thread for that)

Other than the cast(field AS VARCHAR) as field  is there any other (perhaps
more performant) way to handle this situation?





On Mon, May 23, 2016 at 8:31 AM, Todd <[email protected]> wrote:

>
> Looks like Impala encoded string as binary data, I think there is some
> configuration in Drill(I know spark has) that helps do the conversion.
>
>
>
>
>
> At 2016-05-23 21:25:17, "John Omernik" <[email protected]> wrote:
> >Hey all, I have some Parquet files that I believe were made in a Map
> Reduce
> >job and work well in Impala, however, when I read them in Drill, the
> fields
> >that are strings come through as [B@25ddbb etc. The exact string
> >represented as regex would be /\[B@[a-f0-9]{8}/  (Pointers maybe?)
> >
> >Well, I found I  can cast those fields as Varchar... and get the right
> >data... is this the right approach?  Why is this happening? Performance
> >wise am I hurting something by doing the cast to Varchar?
> >
> >
> >Any thoughts would be helpful...
> >
> >John
>

Re: Reading Parquet Files Created Elsewhere

Reply via email to