Re: Reading and converting Parquet files intended for Impala

Ted Dunning Sun, 29 May 2016 05:19:38 -0700

PARQUET-244 might have a similar bug in the custom Drill reader.



On Sun, May 29, 2016 at 1:12 PM, John Omernik <[email protected]> wrote:

> *sigh PARQUET-244 is likely not my issue considering that
> DeltaByteArrayWriter isn't in my stack trace. (I love learning CS101 type
> stuff in front of a whole community, it's great for self esteem! :)
>
>
>
> On Sun, May 29, 2016 at 7:08 AM, John Omernik <[email protected]> wrote:
>
> > Doing more research, I found this:
> >
> > https://issues.apache.org/jira/browse/PARQUET-244
> >
> > So the version that is being written is 1.5-cdh, thus the writer does
> have
> > this bug. Question is A. Could we reproduce this in Drill 1.6 to see if
> the
> > default reader has the same error on know bad data, and B. Should Drill
> be
> > able to handle reading data created with this bug? (Note: The Parquet
> > project seemed to implement handling of reading data created with this
> bug (
> > https://github.com/apache/parquet-mr/pull/235)  Note: I am not sure this
> > is the same, thing I am seeing, I am just trying to find the things in
> the
> > Parquet project that seem close to what I am seeing)
> >
> > John
> >
> >
> > On Sat, May 28, 2016 at 9:28 AM, John Omernik <[email protected]> wrote:
> >
> >> New Update
> >>
> >> Thank you Abdel for giving me an idea to try.
> >>
> >> When I was first doing the CTAS, I tried setting
> store.parquet.use_new_reader
> >> = true. What occurred when I did that, was Drill effectively "hung"  I
> am
> >> not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB
> of
> >> Heap, 84 GB of Direct Memory).
> >>
> >> But now that I've gotten further in troubleshooting, I have "one" bad
> >> file, and so I tried the min(row_created_ts)  on the one bad file. With
> the store.parquet.use_new_reader
> >> set to false (the default) I get the Array Index Out of Bounds, but
> when I
> >> set to true, on the one file it now works.  So the "new" reader can
> handle
> >> the file, that's interesting. It still leaves me in a bit of a bind,
> >> because setting the new reader to true on the CTAS doesn't actually work
> >> (like I said, memory issues etc).  Any ideas on the new reader, and how
> I
> >> could get memory consumption down and actually have that succeed?  I am
> not
> >> doing any casting from the original Parquet files (would that help?)
> All I
> >> am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string
> fields
> >> (because for some reason the Parquet string fields are read as binary in
> >> Drill).  I was assuming (hoping?) that on a CTAS from Parquet to Parquet
> >> that the types would be preserved, is that an incorrect assumption?
> >>
> >> Given this new piece of information, are there other steps I may want to
> >> try/attempt?
> >>
> >> Thanks Abdel for the idea!
> >>
> >>
> >>
> >> On Sat, May 28, 2016 at 8:50 AM, John Omernik <[email protected]> wrote:
> >>
> >>> Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
> >>> point, and I hate that I have the restrictions on sharing the whole
> file, I
> >>> am just looking for new ways to troubleshoot the problem. I know the
> MapR
> >>> support team is scratching their heads on next steps as well. I did
> offer
> >>> to them, (and I offer to others who may want to look into the problem)
> a
> >>> screen share with me, even allowing control and in depth
> troubleshooting.
> >>> The cluster is not yet production, thus I can restart things change
> debug
> >>> settings, etc, and work with anyone who may be interested. (I know
> it's not
> >>> much to offer, a time consuming phone call to help someone else on a
> >>> problem) but I do offer it. Any other ideas would also be welcome.
> >>>
> >>> John
> >>>
> >>>
> >>> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <[email protected]>
> >>> wrote:
> >>>
> >>>> The Parquet user/dev mailing list might be helpful here. They have a
> >>>> real
> >>>> stake in making sure that all readers/writers can work together. The
> >>>> problem here really does sound like there is a borderline case that
> >>>> isn't
> >>>> handled as well in the Drill special purpose parquet reader as in the
> >>>> normal readers.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <[email protected]>
> wrote:
> >>>>
> >>>> > So working with MapR support we tried that with Impala, but it
> didn't
> >>>> > produce the desired results because the outputfile worked fine in
> >>>> Drill.
> >>>> > Theory: Evil file is created in Mapr Reduce, and is using a
> different
> >>>> > writer than Impala is using. Impala can read the evil file, but when
> >>>> it
> >>>> > writes it uses it's own writer, "fixing" the issue on the fly.
> Thus,
> >>>> Drill
> >>>> > can't read evil file, but if we try to reduce with Impala, files is
> no
> >>>> > longer evil, consider it... chaotic neutral ... (For all you D&D
> fans
> >>>> )
> >>>> >
> >>>> > I'd ideally love to extract into badness, but on the phone now with
> >>>> MapR
> >>>> > support to figure out HOW, that is the question at hand.
> >>>> >
> >>>> > John
> >>>> >
> >>>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <
> [email protected]>
> >>>> > wrote:
> >>>> >
> >>>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected]>
> >>>> wrote:
> >>>> > >
> >>>> > > > So, if we have a known "bad" Parquet file (I use quotes, because
> >>>> > > remember,
> >>>> > > > Impala queries this file just fine) created in Map Reduce, with
> a
> >>>> > column
> >>>> > > > causing Array Index Out of Bounds problems with a BIGINT typed
> >>>> column.
> >>>> > > What
> >>>> > > > would your next steps be to troubleshoot?
> >>>> > > >
> >>>> > >
> >>>> > > I would start reducing the size of the evil file.
> >>>> > >
> >>>> > > If you have a tool that can query the bad parquet and write a new
> >>>> one
> >>>> > > (sounds like Impala might do here) then selecting just the evil
> >>>> column
> >>>> > is a
> >>>> > > good first step.
> >>>> > >
> >>>> > > After that, I would start bisecting to find a small range that
> still
> >>>> > causes
> >>>> > > the problem. There may not be such, but it is good thing to try.
> >>>> > >
> >>>> > > At that point, you could easily have the problem down to a few
> >>>> kilobytes
> >>>> > of
> >>>> > > data that can be used in a unit test.
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Re: Reading and converting Parquet files intended for Impala

Reply via email to