So working with MapR support we tried that with Impala, but it didn't produce the desired results because the outputfile worked fine in Drill. Theory: Evil file is created in Mapr Reduce, and is using a different writer than Impala is using. Impala can read the evil file, but when it writes it uses it's own writer, "fixing" the issue on the fly. Thus, Drill can't read evil file, but if we try to reduce with Impala, files is no longer evil, consider it... chaotic neutral ... (For all you D&D fans )
I'd ideally love to extract into badness, but on the phone now with MapR support to figure out HOW, that is the question at hand. John On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <[email protected]> wrote: > On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected]> wrote: > > > So, if we have a known "bad" Parquet file (I use quotes, because > remember, > > Impala queries this file just fine) created in Map Reduce, with a column > > causing Array Index Out of Bounds problems with a BIGINT typed column. > What > > would your next steps be to troubleshoot? > > > > I would start reducing the size of the evil file. > > If you have a tool that can query the bad parquet and write a new one > (sounds like Impala might do here) then selecting just the evil column is a > good first step. > > After that, I would start bisecting to find a small range that still causes > the problem. There may not be such, but it is good thing to try. > > At that point, you could easily have the problem down to a few kilobytes of > data that can be used in a unit test. >
