Thanks Ted, I summarized the problem to the Parquet Dev list.  At this
point, and I hate that I have the restrictions on sharing the whole file, I
am just looking for new ways to troubleshoot the problem. I know the MapR
support team is scratching their heads on next steps as well. I did offer
to them, (and I offer to others who may want to look into the problem) a
screen share with me, even allowing control and in depth troubleshooting.
The cluster is not yet production, thus I can restart things change debug
settings, etc, and work with anyone who may be interested. (I know it's not
much to offer, a time consuming phone call to help someone else on a
problem) but I do offer it. Any other ideas would also be welcome.

John


On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <[email protected]> wrote:

> The Parquet user/dev mailing list might be helpful here. They have a real
> stake in making sure that all readers/writers can work together. The
> problem here really does sound like there is a borderline case that isn't
> handled as well in the Drill special purpose parquet reader as in the
> normal readers.
>
>
>
>
>
> On Fri, May 27, 2016 at 7:23 PM, John Omernik <[email protected]> wrote:
>
> > So working with MapR support we tried that with Impala, but it didn't
> > produce the desired results because the outputfile worked fine in Drill.
> > Theory: Evil file is created in Mapr Reduce, and is using a different
> > writer than Impala is using. Impala can read the evil file, but when it
> > writes it uses it's own writer, "fixing" the issue on the fly.  Thus,
> Drill
> > can't read evil file, but if we try to reduce with Impala, files is no
> > longer evil, consider it... chaotic neutral ... (For all you D&D fans )
> >
> > I'd ideally love to extract into badness, but on the phone now with MapR
> > support to figure out HOW, that is the question at hand.
> >
> > John
> >
> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected]>
> wrote:
> > >
> > > > So, if we have a known "bad" Parquet file (I use quotes, because
> > > remember,
> > > > Impala queries this file just fine) created in Map Reduce, with a
> > column
> > > > causing Array Index Out of Bounds problems with a BIGINT typed
> column.
> > > What
> > > > would your next steps be to troubleshoot?
> > > >
> > >
> > > I would start reducing the size of the evil file.
> > >
> > > If you have a tool that can query the bad parquet and write a new one
> > > (sounds like Impala might do here) then selecting just the evil column
> > is a
> > > good first step.
> > >
> > > After that, I would start bisecting to find a small range that still
> > causes
> > > the problem. There may not be such, but it is good thing to try.
> > >
> > > At that point, you could easily have the problem down to a few
> kilobytes
> > of
> > > data that can be used in a unit test.
> > >
> >
>

Reply via email to