Re: Reading and converting Parquet files intended for Impala

John Omernik Sun, 29 May 2016 05:13:29 -0700

*sigh PARQUET-244 is likely not my issue considering that
DeltaByteArrayWriter isn't in my stack trace. (I love learning CS101 type
stuff in front of a whole community, it's great for self esteem! :)




On Sun, May 29, 2016 at 7:08 AM, John Omernik <[email protected]> wrote:

> Doing more research, I found this:
>
> https://issues.apache.org/jira/browse/PARQUET-244
>
> So the version that is being written is 1.5-cdh, thus the writer does have
> this bug. Question is A. Could we reproduce this in Drill 1.6 to see if the
> default reader has the same error on know bad data, and B. Should Drill be
> able to handle reading data created with this bug? (Note: The Parquet
> project seemed to implement handling of reading data created with this bug (
> https://github.com/apache/parquet-mr/pull/235)  Note: I am not sure this
> is the same, thing I am seeing, I am just trying to find the things in the
> Parquet project that seem close to what I am seeing)
>
> John
>
>
> On Sat, May 28, 2016 at 9:28 AM, John Omernik <[email protected]> wrote:
>
>> New Update
>>
>> Thank you Abdel for giving me an idea to try.
>>
>> When I was first doing the CTAS, I tried setting store.parquet.use_new_reader
>> = true. What occurred when I did that, was Drill effectively "hung"  I am
>> not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB of
>> Heap, 84 GB of Direct Memory).
>>
>> But now that I've gotten further in troubleshooting, I have "one" bad
>> file, and so I tried the min(row_created_ts)  on the one bad file. With the 
>> store.parquet.use_new_reader
>> set to false (the default) I get the Array Index Out of Bounds, but when I
>> set to true, on the one file it now works.  So the "new" reader can handle
>> the file, that's interesting. It still leaves me in a bit of a bind,
>> because setting the new reader to true on the CTAS doesn't actually work
>> (like I said, memory issues etc).  Any ideas on the new reader, and how I
>> could get memory consumption down and actually have that succeed?  I am not
>> doing any casting from the original Parquet files (would that help?) All I
>> am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string fields
>> (because for some reason the Parquet string fields are read as binary in
>> Drill).  I was assuming (hoping?) that on a CTAS from Parquet to Parquet
>> that the types would be preserved, is that an incorrect assumption?
>>
>> Given this new piece of information, are there other steps I may want to
>> try/attempt?
>>
>> Thanks Abdel for the idea!
>>
>>
>>
>> On Sat, May 28, 2016 at 8:50 AM, John Omernik <[email protected]> wrote:
>>
>>> Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
>>> point, and I hate that I have the restrictions on sharing the whole file, I
>>> am just looking for new ways to troubleshoot the problem. I know the MapR
>>> support team is scratching their heads on next steps as well. I did offer
>>> to them, (and I offer to others who may want to look into the problem) a
>>> screen share with me, even allowing control and in depth troubleshooting.
>>> The cluster is not yet production, thus I can restart things change debug
>>> settings, etc, and work with anyone who may be interested. (I know it's not
>>> much to offer, a time consuming phone call to help someone else on a
>>> problem) but I do offer it. Any other ideas would also be welcome.
>>>
>>> John
>>>
>>>
>>> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <[email protected]>
>>> wrote:
>>>
>>>> The Parquet user/dev mailing list might be helpful here. They have a
>>>> real
>>>> stake in making sure that all readers/writers can work together. The
>>>> problem here really does sound like there is a borderline case that
>>>> isn't
>>>> handled as well in the Drill special purpose parquet reader as in the
>>>> normal readers.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <[email protected]> wrote:
>>>>
>>>> > So working with MapR support we tried that with Impala, but it didn't
>>>> > produce the desired results because the outputfile worked fine in
>>>> Drill.
>>>> > Theory: Evil file is created in Mapr Reduce, and is using a different
>>>> > writer than Impala is using. Impala can read the evil file, but when
>>>> it
>>>> > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
>>>> Drill
>>>> > can't read evil file, but if we try to reduce with Impala, files is no
>>>> > longer evil, consider it... chaotic neutral ... (For all you D&D fans
>>>> )
>>>> >
>>>> > I'd ideally love to extract into badness, but on the phone now with
>>>> MapR
>>>> > support to figure out HOW, that is the question at hand.
>>>> >
>>>> > John
>>>> >
>>>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <[email protected]>
>>>> > wrote:
>>>> >
>>>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected]>
>>>> wrote:
>>>> > >
>>>> > > > So, if we have a known "bad" Parquet file (I use quotes, because
>>>> > > remember,
>>>> > > > Impala queries this file just fine) created in Map Reduce, with a
>>>> > column
>>>> > > > causing Array Index Out of Bounds problems with a BIGINT typed
>>>> column.
>>>> > > What
>>>> > > > would your next steps be to troubleshoot?
>>>> > > >
>>>> > >
>>>> > > I would start reducing the size of the evil file.
>>>> > >
>>>> > > If you have a tool that can query the bad parquet and write a new
>>>> one
>>>> > > (sounds like Impala might do here) then selecting just the evil
>>>> column
>>>> > is a
>>>> > > good first step.
>>>> > >
>>>> > > After that, I would start bisecting to find a small range that still
>>>> > causes
>>>> > > the problem. There may not be such, but it is good thing to try.
>>>> > >
>>>> > > At that point, you could easily have the problem down to a few
>>>> kilobytes
>>>> > of
>>>> > > data that can be used in a unit test.
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Reading and converting Parquet files intended for Impala

Reply via email to