I remember reading that drill uses two readers. One for certain cases ( I think flat structures) and the other for complex structures. A. Am I remembering correctly? B. If so, can I determine via the plan or something which is being used? And C. Can I force Drill to try the other reader?
On Saturday, May 28, 2016, Ted Dunning <[email protected]> wrote: > The Parquet user/dev mailing list might be helpful here. They have a real > stake in making sure that all readers/writers can work together. The > problem here really does sound like there is a borderline case that isn't > handled as well in the Drill special purpose parquet reader as in the > normal readers. > > > > > > On Fri, May 27, 2016 at 7:23 PM, John Omernik <[email protected] > <javascript:;>> wrote: > > > So working with MapR support we tried that with Impala, but it didn't > > produce the desired results because the outputfile worked fine in Drill. > > Theory: Evil file is created in Mapr Reduce, and is using a different > > writer than Impala is using. Impala can read the evil file, but when it > > writes it uses it's own writer, "fixing" the issue on the fly. Thus, > Drill > > can't read evil file, but if we try to reduce with Impala, files is no > > longer evil, consider it... chaotic neutral ... (For all you D&D fans ) > > > > I'd ideally love to extract into badness, but on the phone now with MapR > > support to figure out HOW, that is the question at hand. > > > > John > > > > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <[email protected] > <javascript:;>> > > wrote: > > > > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected] > <javascript:;>> wrote: > > > > > > > So, if we have a known "bad" Parquet file (I use quotes, because > > > remember, > > > > Impala queries this file just fine) created in Map Reduce, with a > > column > > > > causing Array Index Out of Bounds problems with a BIGINT typed > column. > > > What > > > > would your next steps be to troubleshoot? > > > > > > > > > > I would start reducing the size of the evil file. > > > > > > If you have a tool that can query the bad parquet and write a new one > > > (sounds like Impala might do here) then selecting just the evil column > > is a > > > good first step. > > > > > > After that, I would start bisecting to find a small range that still > > causes > > > the problem. There may not be such, but it is good thing to try. > > > > > > At that point, you could easily have the problem down to a few > kilobytes > > of > > > data that can be used in a unit test. > > > > > > -- Sent from my iThing
