Re: Reading Parquet Files Created Elsewhere

John Omernik Mon, 23 May 2016 08:08:46 -0700

That did work faster, I can now get the 10 rows in 12 seconds as opposed to
25.


So in my 25 sec. query, I CAST all items from the parquet, but do I need to
that? for the 12 seconds query, I only CONVERT_FROM on the string values,
the view seems happier. So that's nice.

Thanks for the point, I am playing around CTAS this into another table, I
will try the CTAS (see the other thread) with the CONVERT_FROM rather than
CAST.

Thanks!

On Mon, May 23, 2016 at 9:45 AM, Andries Engelbrecht <
[email protected]> wrote:

> John,
>
> See if convert_from helps in this regard, I believe it is supposed to be
> faster than cast varchar.
>
> This is likely what will work on your data
> CONVERT_FROM(<column>, 'UTF8')
>
> Hopefully someone with more in depth knowledge of the Drill Parquet reader
> can comment.
>
> --Andries
>
>
>
> > On May 23, 2016, at 7:35 AM, John Omernik <[email protected]> wrote:
> >
> > I am learning more about my data here, the data was created in a CDH
> > version of the apache parquet-mr library. (Not sure version yet, getting
> > that soon).  They used snappy and version 1.0 of the Parquet spec due to
> > Impala needing it.  They are also using setEnableDictionary on the write.
> >
> > Trying to figure things out right now
> >
> > If I make a view and cast all string fields to a VARCHAR drill shows the
> > right result, but it's slow.
> >
> > (10 row select from raw = 1.9 seconds, 10 row select with CAST in a view
> =
> > 25 seconds)
> >
> > I've resigned myself to converting the table once for performance, which
> > isn't an issue however I am getting different issues on that front  (I'll
> > open a new thread for that)
> >
> > Other than the cast(field AS VARCHAR) as field  is there any other
> (perhaps
> > more performant) way to handle this situation?
> >
> >
> >
> >
> >
> > On Mon, May 23, 2016 at 8:31 AM, Todd <[email protected]> wrote:
> >
> >>
> >> Looks like Impala encoded string as binary data, I think there is some
> >> configuration in Drill(I know spark has) that helps do the conversion.
> >>
> >>
> >>
> >>
> >>
> >> At 2016-05-23 21:25:17, "John Omernik" <[email protected]> wrote:
> >>> Hey all, I have some Parquet files that I believe were made in a Map
> >> Reduce
> >>> job and work well in Impala, however, when I read them in Drill, the
> >> fields
> >>> that are strings come through as [B@25ddbb etc. The exact string
> >>> represented as regex would be /\[B@[a-f0-9]{8}/  (Pointers maybe?)
> >>>
> >>> Well, I found I  can cast those fields as Varchar... and get the right
> >>> data... is this the right approach?  Why is this happening? Performance
> >>> wise am I hurting something by doing the cast to Varchar?
> >>>
> >>>
> >>> Any thoughts would be helpful...
> >>>
> >>> John
> >>
>
>

Re: Reading Parquet Files Created Elsewhere

Reply via email to