On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected]> wrote:

> So, if we have a known "bad" Parquet file (I use quotes, because remember,
> Impala queries this file just fine) created in Map Reduce, with a column
> causing Array Index Out of Bounds problems with a BIGINT typed column. What
> would your next steps be to troubleshoot?
>

I would start reducing the size of the evil file.

If you have a tool that can query the bad parquet and write a new one
(sounds like Impala might do here) then selecting just the evil column is a
good first step.

After that, I would start bisecting to find a small range that still causes
the problem. There may not be such, but it is good thing to try.

At that point, you could easily have the problem down to a few kilobytes of
data that can be used in a unit test.

Reply via email to