New Update Thank you Abdel for giving me an idea to try.
When I was first doing the CTAS, I tried setting store.parquet.use_new_reader = true. What occurred when I did that, was Drill effectively "hung" I am not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB of Heap, 84 GB of Direct Memory). But now that I've gotten further in troubleshooting, I have "one" bad file, and so I tried the min(row_created_ts) on the one bad file. With the store.parquet.use_new_reader set to false (the default) I get the Array Index Out of Bounds, but when I set to true, on the one file it now works. So the "new" reader can handle the file, that's interesting. It still leaves me in a bit of a bind, because setting the new reader to true on the CTAS doesn't actually work (like I said, memory issues etc). Any ideas on the new reader, and how I could get memory consumption down and actually have that succeed? I am not doing any casting from the original Parquet files (would that help?) All I am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string fields (because for some reason the Parquet string fields are read as binary in Drill). I was assuming (hoping?) that on a CTAS from Parquet to Parquet that the types would be preserved, is that an incorrect assumption? Given this new piece of information, are there other steps I may want to try/attempt? Thanks Abdel for the idea! On Sat, May 28, 2016 at 8:50 AM, John Omernik <j...@omernik.com> wrote: > Thanks Ted, I summarized the problem to the Parquet Dev list. At this > point, and I hate that I have the restrictions on sharing the whole file, I > am just looking for new ways to troubleshoot the problem. I know the MapR > support team is scratching their heads on next steps as well. I did offer > to them, (and I offer to others who may want to look into the problem) a > screen share with me, even allowing control and in depth troubleshooting. > The cluster is not yet production, thus I can restart things change debug > settings, etc, and work with anyone who may be interested. (I know it's not > much to offer, a time consuming phone call to help someone else on a > problem) but I do offer it. Any other ideas would also be welcome. > > John > > > On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > >> The Parquet user/dev mailing list might be helpful here. They have a real >> stake in making sure that all readers/writers can work together. The >> problem here really does sound like there is a borderline case that isn't >> handled as well in the Drill special purpose parquet reader as in the >> normal readers. >> >> >> >> >> >> On Fri, May 27, 2016 at 7:23 PM, John Omernik <j...@omernik.com> wrote: >> >> > So working with MapR support we tried that with Impala, but it didn't >> > produce the desired results because the outputfile worked fine in Drill. >> > Theory: Evil file is created in Mapr Reduce, and is using a different >> > writer than Impala is using. Impala can read the evil file, but when it >> > writes it uses it's own writer, "fixing" the issue on the fly. Thus, >> Drill >> > can't read evil file, but if we try to reduce with Impala, files is no >> > longer evil, consider it... chaotic neutral ... (For all you D&D fans ) >> > >> > I'd ideally love to extract into badness, but on the phone now with MapR >> > support to figure out HOW, that is the question at hand. >> > >> > John >> > >> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <ted.dunn...@gmail.com> >> > wrote: >> > >> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <j...@omernik.com> >> wrote: >> > > >> > > > So, if we have a known "bad" Parquet file (I use quotes, because >> > > remember, >> > > > Impala queries this file just fine) created in Map Reduce, with a >> > column >> > > > causing Array Index Out of Bounds problems with a BIGINT typed >> column. >> > > What >> > > > would your next steps be to troubleshoot? >> > > > >> > > >> > > I would start reducing the size of the evil file. >> > > >> > > If you have a tool that can query the bad parquet and write a new one >> > > (sounds like Impala might do here) then selecting just the evil column >> > is a >> > > good first step. >> > > >> > > After that, I would start bisecting to find a small range that still >> > causes >> > > the problem. There may not be such, but it is good thing to try. >> > > >> > > At that point, you could easily have the problem down to a few >> kilobytes >> > of >> > > data that can be used in a unit test. >> > > >> > >> > >