PARQUET-244 might have a similar bug in the custom Drill reader.
On Sun, May 29, 2016 at 1:12 PM, John Omernik <[email protected]> wrote: > *sigh PARQUET-244 is likely not my issue considering that > DeltaByteArrayWriter isn't in my stack trace. (I love learning CS101 type > stuff in front of a whole community, it's great for self esteem! :) > > > > On Sun, May 29, 2016 at 7:08 AM, John Omernik <[email protected]> wrote: > > > Doing more research, I found this: > > > > https://issues.apache.org/jira/browse/PARQUET-244 > > > > So the version that is being written is 1.5-cdh, thus the writer does > have > > this bug. Question is A. Could we reproduce this in Drill 1.6 to see if > the > > default reader has the same error on know bad data, and B. Should Drill > be > > able to handle reading data created with this bug? (Note: The Parquet > > project seemed to implement handling of reading data created with this > bug ( > > https://github.com/apache/parquet-mr/pull/235) Note: I am not sure this > > is the same, thing I am seeing, I am just trying to find the things in > the > > Parquet project that seem close to what I am seeing) > > > > John > > > > > > On Sat, May 28, 2016 at 9:28 AM, John Omernik <[email protected]> wrote: > > > >> New Update > >> > >> Thank you Abdel for giving me an idea to try. > >> > >> When I was first doing the CTAS, I tried setting > store.parquet.use_new_reader > >> = true. What occurred when I did that, was Drill effectively "hung" I > am > >> not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB > of > >> Heap, 84 GB of Direct Memory). > >> > >> But now that I've gotten further in troubleshooting, I have "one" bad > >> file, and so I tried the min(row_created_ts) on the one bad file. With > the store.parquet.use_new_reader > >> set to false (the default) I get the Array Index Out of Bounds, but > when I > >> set to true, on the one file it now works. So the "new" reader can > handle > >> the file, that's interesting. It still leaves me in a bit of a bind, > >> because setting the new reader to true on the CTAS doesn't actually work > >> (like I said, memory issues etc). Any ideas on the new reader, and how > I > >> could get memory consumption down and actually have that succeed? I am > not > >> doing any casting from the original Parquet files (would that help?) > All I > >> am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string > fields > >> (because for some reason the Parquet string fields are read as binary in > >> Drill). I was assuming (hoping?) that on a CTAS from Parquet to Parquet > >> that the types would be preserved, is that an incorrect assumption? > >> > >> Given this new piece of information, are there other steps I may want to > >> try/attempt? > >> > >> Thanks Abdel for the idea! > >> > >> > >> > >> On Sat, May 28, 2016 at 8:50 AM, John Omernik <[email protected]> wrote: > >> > >>> Thanks Ted, I summarized the problem to the Parquet Dev list. At this > >>> point, and I hate that I have the restrictions on sharing the whole > file, I > >>> am just looking for new ways to troubleshoot the problem. I know the > MapR > >>> support team is scratching their heads on next steps as well. I did > offer > >>> to them, (and I offer to others who may want to look into the problem) > a > >>> screen share with me, even allowing control and in depth > troubleshooting. > >>> The cluster is not yet production, thus I can restart things change > debug > >>> settings, etc, and work with anyone who may be interested. (I know > it's not > >>> much to offer, a time consuming phone call to help someone else on a > >>> problem) but I do offer it. Any other ideas would also be welcome. > >>> > >>> John > >>> > >>> > >>> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <[email protected]> > >>> wrote: > >>> > >>>> The Parquet user/dev mailing list might be helpful here. They have a > >>>> real > >>>> stake in making sure that all readers/writers can work together. The > >>>> problem here really does sound like there is a borderline case that > >>>> isn't > >>>> handled as well in the Drill special purpose parquet reader as in the > >>>> normal readers. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <[email protected]> > wrote: > >>>> > >>>> > So working with MapR support we tried that with Impala, but it > didn't > >>>> > produce the desired results because the outputfile worked fine in > >>>> Drill. > >>>> > Theory: Evil file is created in Mapr Reduce, and is using a > different > >>>> > writer than Impala is using. Impala can read the evil file, but when > >>>> it > >>>> > writes it uses it's own writer, "fixing" the issue on the fly. > Thus, > >>>> Drill > >>>> > can't read evil file, but if we try to reduce with Impala, files is > no > >>>> > longer evil, consider it... chaotic neutral ... (For all you D&D > fans > >>>> ) > >>>> > > >>>> > I'd ideally love to extract into badness, but on the phone now with > >>>> MapR > >>>> > support to figure out HOW, that is the question at hand. > >>>> > > >>>> > John > >>>> > > >>>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning < > [email protected]> > >>>> > wrote: > >>>> > > >>>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected]> > >>>> wrote: > >>>> > > > >>>> > > > So, if we have a known "bad" Parquet file (I use quotes, because > >>>> > > remember, > >>>> > > > Impala queries this file just fine) created in Map Reduce, with > a > >>>> > column > >>>> > > > causing Array Index Out of Bounds problems with a BIGINT typed > >>>> column. > >>>> > > What > >>>> > > > would your next steps be to troubleshoot? > >>>> > > > > >>>> > > > >>>> > > I would start reducing the size of the evil file. > >>>> > > > >>>> > > If you have a tool that can query the bad parquet and write a new > >>>> one > >>>> > > (sounds like Impala might do here) then selecting just the evil > >>>> column > >>>> > is a > >>>> > > good first step. > >>>> > > > >>>> > > After that, I would start bisecting to find a small range that > still > >>>> > causes > >>>> > > the problem. There may not be such, but it is good thing to try. > >>>> > > > >>>> > > At that point, you could easily have the problem down to a few > >>>> kilobytes > >>>> > of > >>>> > > data that can be used in a unit test. > >>>> > > > >>>> > > >>>> > >>> > >>> > >> > > >
