John,

See if convert_from helps in this regard, I believe it is supposed to be faster 
than cast varchar.

This is likely what will work on your data 
CONVERT_FROM(<column>, 'UTF8')

Hopefully someone with more in depth knowledge of the Drill Parquet reader can 
comment.

--Andries



> On May 23, 2016, at 7:35 AM, John Omernik <[email protected]> wrote:
> 
> I am learning more about my data here, the data was created in a CDH
> version of the apache parquet-mr library. (Not sure version yet, getting
> that soon).  They used snappy and version 1.0 of the Parquet spec due to
> Impala needing it.  They are also using setEnableDictionary on the write.
> 
> Trying to figure things out right now
> 
> If I make a view and cast all string fields to a VARCHAR drill shows the
> right result, but it's slow.
> 
> (10 row select from raw = 1.9 seconds, 10 row select with CAST in a view =
> 25 seconds)
> 
> I've resigned myself to converting the table once for performance, which
> isn't an issue however I am getting different issues on that front  (I'll
> open a new thread for that)
> 
> Other than the cast(field AS VARCHAR) as field  is there any other (perhaps
> more performant) way to handle this situation?
> 
> 
> 
> 
> 
> On Mon, May 23, 2016 at 8:31 AM, Todd <[email protected]> wrote:
> 
>> 
>> Looks like Impala encoded string as binary data, I think there is some
>> configuration in Drill(I know spark has) that helps do the conversion.
>> 
>> 
>> 
>> 
>> 
>> At 2016-05-23 21:25:17, "John Omernik" <[email protected]> wrote:
>>> Hey all, I have some Parquet files that I believe were made in a Map
>> Reduce
>>> job and work well in Impala, however, when I read them in Drill, the
>> fields
>>> that are strings come through as [B@25ddbb etc. The exact string
>>> represented as regex would be /\[B@[a-f0-9]{8}/  (Pointers maybe?)
>>> 
>>> Well, I found I  can cast those fields as Varchar... and get the right
>>> data... is this the right approach?  Why is this happening? Performance
>>> wise am I hurting something by doing the cast to Varchar?
>>> 
>>> 
>>> Any thoughts would be helpful...
>>> 
>>> John
>> 

Reply via email to