On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected]> wrote:
> So, if we have a known "bad" Parquet file (I use quotes, because remember, > Impala queries this file just fine) created in Map Reduce, with a column > causing Array Index Out of Bounds problems with a BIGINT typed column. What > would your next steps be to troubleshoot? > I would start reducing the size of the evil file. If you have a tool that can query the bad parquet and write a new one (sounds like Impala might do here) then selecting just the evil column is a good first step. After that, I would start bisecting to find a small range that still causes the problem. There may not be such, but it is good thing to try. At that point, you could easily have the problem down to a few kilobytes of data that can be used in a unit test.
