New Update

Thank you Abdel for giving me an idea to try.

When I was first doing the CTAS, I tried setting store.parquet.use_new_reader
= true. What occurred when I did that, was Drill effectively "hung"  I am
not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB of
Heap, 84 GB of Direct Memory).

But now that I've gotten further in troubleshooting, I have "one" bad file,
and so I tried the min(row_created_ts)  on the one bad file. With the
store.parquet.use_new_reader
set to false (the default) I get the Array Index Out of Bounds, but when I
set to true, on the one file it now works.  So the "new" reader can handle
the file, that's interesting. It still leaves me in a bit of a bind,
because setting the new reader to true on the CTAS doesn't actually work
(like I said, memory issues etc).  Any ideas on the new reader, and how I
could get memory consumption down and actually have that succeed?  I am not
doing any casting from the original Parquet files (would that help?) All I
am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string fields
(because for some reason the Parquet string fields are read as binary in
Drill).  I was assuming (hoping?) that on a CTAS from Parquet to Parquet
that the types would be preserved, is that an incorrect assumption?

Given this new piece of information, are there other steps I may want to
try/attempt?

Thanks Abdel for the idea!



On Sat, May 28, 2016 at 8:50 AM, John Omernik <j...@omernik.com> wrote:

> Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
> point, and I hate that I have the restrictions on sharing the whole file, I
> am just looking for new ways to troubleshoot the problem. I know the MapR
> support team is scratching their heads on next steps as well. I did offer
> to them, (and I offer to others who may want to look into the problem) a
> screen share with me, even allowing control and in depth troubleshooting.
> The cluster is not yet production, thus I can restart things change debug
> settings, etc, and work with anyone who may be interested. (I know it's not
> much to offer, a time consuming phone call to help someone else on a
> problem) but I do offer it. Any other ideas would also be welcome.
>
> John
>
>
> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
>> The Parquet user/dev mailing list might be helpful here. They have a real
>> stake in making sure that all readers/writers can work together. The
>> problem here really does sound like there is a borderline case that isn't
>> handled as well in the Drill special purpose parquet reader as in the
>> normal readers.
>>
>>
>>
>>
>>
>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <j...@omernik.com> wrote:
>>
>> > So working with MapR support we tried that with Impala, but it didn't
>> > produce the desired results because the outputfile worked fine in Drill.
>> > Theory: Evil file is created in Mapr Reduce, and is using a different
>> > writer than Impala is using. Impala can read the evil file, but when it
>> > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
>> Drill
>> > can't read evil file, but if we try to reduce with Impala, files is no
>> > longer evil, consider it... chaotic neutral ... (For all you D&D fans )
>> >
>> > I'd ideally love to extract into badness, but on the phone now with MapR
>> > support to figure out HOW, that is the question at hand.
>> >
>> > John
>> >
>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <ted.dunn...@gmail.com>
>> > wrote:
>> >
>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <j...@omernik.com>
>> wrote:
>> > >
>> > > > So, if we have a known "bad" Parquet file (I use quotes, because
>> > > remember,
>> > > > Impala queries this file just fine) created in Map Reduce, with a
>> > column
>> > > > causing Array Index Out of Bounds problems with a BIGINT typed
>> column.
>> > > What
>> > > > would your next steps be to troubleshoot?
>> > > >
>> > >
>> > > I would start reducing the size of the evil file.
>> > >
>> > > If you have a tool that can query the bad parquet and write a new one
>> > > (sounds like Impala might do here) then selecting just the evil column
>> > is a
>> > > good first step.
>> > >
>> > > After that, I would start bisecting to find a small range that still
>> > causes
>> > > the problem. There may not be such, but it is good thing to try.
>> > >
>> > > At that point, you could easily have the problem down to a few
>> kilobytes
>> > of
>> > > data that can be used in a unit test.
>> > >
>> >
>>
>
>

Reply via email to