Ya, but based on my testing, if it's accurate to say that when store.parquet.use_new_reader is false, than Drill is NOT using the customer reader, then the bug is not occurring in the customer reader. When I do the min/max with store.parquet.use_new_reader set to true, it actually returns the values properly. My issue with the store.parquet.use_new_reader = true is when I do a CTAS I get heap space issues (even at 24 GB) and node seem to restart or something, (It's a very weird situation that the bits get into, everything hangs, somethings work, other things seem to be a in an in-between between working and not working etc. Like describe table operations eventually return but after 10+seconds. I resolve this by restarting all bits, and then things are right as rain.
That goes to my other question, is there a way to optimize my CTAS statement with the new_reader set to true? I'd be ok with "an" option to do my CTAS, even if I have to set the new reader and some memory options. (I am sure folks in the Drill project may want to know what the bug is on the normal reader, and I would still help with that regardless if I have an option for my CTAS in my use case) On Sun, May 29, 2016 at 7:18 AM, Ted Dunning <[email protected]> wrote: > PARQUET-244 might have a similar bug in the custom Drill reader. > > > > On Sun, May 29, 2016 at 1:12 PM, John Omernik <[email protected]> wrote: > > > *sigh PARQUET-244 is likely not my issue considering that > > DeltaByteArrayWriter isn't in my stack trace. (I love learning CS101 type > > stuff in front of a whole community, it's great for self esteem! :) > > > > > > > > On Sun, May 29, 2016 at 7:08 AM, John Omernik <[email protected]> wrote: > > > > > Doing more research, I found this: > > > > > > https://issues.apache.org/jira/browse/PARQUET-244 > > > > > > So the version that is being written is 1.5-cdh, thus the writer does > > have > > > this bug. Question is A. Could we reproduce this in Drill 1.6 to see if > > the > > > default reader has the same error on know bad data, and B. Should Drill > > be > > > able to handle reading data created with this bug? (Note: The Parquet > > > project seemed to implement handling of reading data created with this > > bug ( > > > https://github.com/apache/parquet-mr/pull/235) Note: I am not sure > this > > > is the same, thing I am seeing, I am just trying to find the things in > > the > > > Parquet project that seem close to what I am seeing) > > > > > > John > > > > > > > > > On Sat, May 28, 2016 at 9:28 AM, John Omernik <[email protected]> > wrote: > > > > > >> New Update > > >> > > >> Thank you Abdel for giving me an idea to try. > > >> > > >> When I was first doing the CTAS, I tried setting > > store.parquet.use_new_reader > > >> = true. What occurred when I did that, was Drill effectively "hung" I > > am > > >> not sure why, perhaps Memory issues? (These are fairly beefy bits, > 24GB > > of > > >> Heap, 84 GB of Direct Memory). > > >> > > >> But now that I've gotten further in troubleshooting, I have "one" bad > > >> file, and so I tried the min(row_created_ts) on the one bad file. > With > > the store.parquet.use_new_reader > > >> set to false (the default) I get the Array Index Out of Bounds, but > > when I > > >> set to true, on the one file it now works. So the "new" reader can > > handle > > >> the file, that's interesting. It still leaves me in a bit of a bind, > > >> because setting the new reader to true on the CTAS doesn't actually > work > > >> (like I said, memory issues etc). Any ideas on the new reader, and > how > > I > > >> could get memory consumption down and actually have that succeed? I > am > > not > > >> doing any casting from the original Parquet files (would that help?) > > All I > > >> am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string > > fields > > >> (because for some reason the Parquet string fields are read as binary > in > > >> Drill). I was assuming (hoping?) that on a CTAS from Parquet to > Parquet > > >> that the types would be preserved, is that an incorrect assumption? > > >> > > >> Given this new piece of information, are there other steps I may want > to > > >> try/attempt? > > >> > > >> Thanks Abdel for the idea! > > >> > > >> > > >> > > >> On Sat, May 28, 2016 at 8:50 AM, John Omernik <[email protected]> > wrote: > > >> > > >>> Thanks Ted, I summarized the problem to the Parquet Dev list. At > this > > >>> point, and I hate that I have the restrictions on sharing the whole > > file, I > > >>> am just looking for new ways to troubleshoot the problem. I know the > > MapR > > >>> support team is scratching their heads on next steps as well. I did > > offer > > >>> to them, (and I offer to others who may want to look into the > problem) > > a > > >>> screen share with me, even allowing control and in depth > > troubleshooting. > > >>> The cluster is not yet production, thus I can restart things change > > debug > > >>> settings, etc, and work with anyone who may be interested. (I know > > it's not > > >>> much to offer, a time consuming phone call to help someone else on a > > >>> problem) but I do offer it. Any other ideas would also be welcome. > > >>> > > >>> John > > >>> > > >>> > > >>> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <[email protected]> > > >>> wrote: > > >>> > > >>>> The Parquet user/dev mailing list might be helpful here. They have a > > >>>> real > > >>>> stake in making sure that all readers/writers can work together. The > > >>>> problem here really does sound like there is a borderline case that > > >>>> isn't > > >>>> handled as well in the Drill special purpose parquet reader as in > the > > >>>> normal readers. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <[email protected]> > > wrote: > > >>>> > > >>>> > So working with MapR support we tried that with Impala, but it > > didn't > > >>>> > produce the desired results because the outputfile worked fine in > > >>>> Drill. > > >>>> > Theory: Evil file is created in Mapr Reduce, and is using a > > different > > >>>> > writer than Impala is using. Impala can read the evil file, but > when > > >>>> it > > >>>> > writes it uses it's own writer, "fixing" the issue on the fly. > > Thus, > > >>>> Drill > > >>>> > can't read evil file, but if we try to reduce with Impala, files > is > > no > > >>>> > longer evil, consider it... chaotic neutral ... (For all you D&D > > fans > > >>>> ) > > >>>> > > > >>>> > I'd ideally love to extract into badness, but on the phone now > with > > >>>> MapR > > >>>> > support to figure out HOW, that is the question at hand. > > >>>> > > > >>>> > John > > >>>> > > > >>>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning < > > [email protected]> > > >>>> > wrote: > > >>>> > > > >>>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected] > > > > >>>> wrote: > > >>>> > > > > >>>> > > > So, if we have a known "bad" Parquet file (I use quotes, > because > > >>>> > > remember, > > >>>> > > > Impala queries this file just fine) created in Map Reduce, > with > > a > > >>>> > column > > >>>> > > > causing Array Index Out of Bounds problems with a BIGINT typed > > >>>> column. > > >>>> > > What > > >>>> > > > would your next steps be to troubleshoot? > > >>>> > > > > > >>>> > > > > >>>> > > I would start reducing the size of the evil file. > > >>>> > > > > >>>> > > If you have a tool that can query the bad parquet and write a > new > > >>>> one > > >>>> > > (sounds like Impala might do here) then selecting just the evil > > >>>> column > > >>>> > is a > > >>>> > > good first step. > > >>>> > > > > >>>> > > After that, I would start bisecting to find a small range that > > still > > >>>> > causes > > >>>> > > the problem. There may not be such, but it is good thing to try. > > >>>> > > > > >>>> > > At that point, you could easily have the problem down to a few > > >>>> kilobytes > > >>>> > of > > >>>> > > data that can be used in a unit test. > > >>>> > > > > >>>> > > > >>>> > > >>> > > >>> > > >> > > > > > >
