That did work faster, I can now get the 10 rows in 12 seconds as opposed to 25.
So in my 25 sec. query, I CAST all items from the parquet, but do I need to that? for the 12 seconds query, I only CONVERT_FROM on the string values, the view seems happier. So that's nice. Thanks for the point, I am playing around CTAS this into another table, I will try the CTAS (see the other thread) with the CONVERT_FROM rather than CAST. Thanks! On Mon, May 23, 2016 at 9:45 AM, Andries Engelbrecht < [email protected]> wrote: > John, > > See if convert_from helps in this regard, I believe it is supposed to be > faster than cast varchar. > > This is likely what will work on your data > CONVERT_FROM(<column>, 'UTF8') > > Hopefully someone with more in depth knowledge of the Drill Parquet reader > can comment. > > --Andries > > > > > On May 23, 2016, at 7:35 AM, John Omernik <[email protected]> wrote: > > > > I am learning more about my data here, the data was created in a CDH > > version of the apache parquet-mr library. (Not sure version yet, getting > > that soon). They used snappy and version 1.0 of the Parquet spec due to > > Impala needing it. They are also using setEnableDictionary on the write. > > > > Trying to figure things out right now > > > > If I make a view and cast all string fields to a VARCHAR drill shows the > > right result, but it's slow. > > > > (10 row select from raw = 1.9 seconds, 10 row select with CAST in a view > = > > 25 seconds) > > > > I've resigned myself to converting the table once for performance, which > > isn't an issue however I am getting different issues on that front (I'll > > open a new thread for that) > > > > Other than the cast(field AS VARCHAR) as field is there any other > (perhaps > > more performant) way to handle this situation? > > > > > > > > > > > > On Mon, May 23, 2016 at 8:31 AM, Todd <[email protected]> wrote: > > > >> > >> Looks like Impala encoded string as binary data, I think there is some > >> configuration in Drill(I know spark has) that helps do the conversion. > >> > >> > >> > >> > >> > >> At 2016-05-23 21:25:17, "John Omernik" <[email protected]> wrote: > >>> Hey all, I have some Parquet files that I believe were made in a Map > >> Reduce > >>> job and work well in Impala, however, when I read them in Drill, the > >> fields > >>> that are strings come through as [B@25ddbb etc. The exact string > >>> represented as regex would be /\[B@[a-f0-9]{8}/ (Pointers maybe?) > >>> > >>> Well, I found I can cast those fields as Varchar... and get the right > >>> data... is this the right approach? Why is this happening? Performance > >>> wise am I hurting something by doing the cast to Varchar? > >>> > >>> > >>> Any thoughts would be helpful... > >>> > >>> John > >> > >
