Doing more research, I found this: https://issues.apache.org/jira/browse/PARQUET-244
So the version that is being written is 1.5-cdh, thus the writer does have this bug. Question is A. Could we reproduce this in Drill 1.6 to see if the default reader has the same error on know bad data, and B. Should Drill be able to handle reading data created with this bug? (Note: The Parquet project seemed to implement handling of reading data created with this bug ( https://github.com/apache/parquet-mr/pull/235) Note: I am not sure this is the same, thing I am seeing, I am just trying to find the things in the Parquet project that seem close to what I am seeing) John On Sat, May 28, 2016 at 9:28 AM, John Omernik <[email protected]> wrote: > New Update > > Thank you Abdel for giving me an idea to try. > > When I was first doing the CTAS, I tried setting store.parquet.use_new_reader > = true. What occurred when I did that, was Drill effectively "hung" I am > not sure why, perhaps Memory issues? (These are fairly beefy bits, 24GB of > Heap, 84 GB of Direct Memory). > > But now that I've gotten further in troubleshooting, I have "one" bad > file, and so I tried the min(row_created_ts) on the one bad file. With the > store.parquet.use_new_reader > set to false (the default) I get the Array Index Out of Bounds, but when I > set to true, on the one file it now works. So the "new" reader can handle > the file, that's interesting. It still leaves me in a bit of a bind, > because setting the new reader to true on the CTAS doesn't actually work > (like I said, memory issues etc). Any ideas on the new reader, and how I > could get memory consumption down and actually have that succeed? I am not > doing any casting from the original Parquet files (would that help?) All I > am doing in my view is the CONVERT_FROM(field, 'UTF8') on the string fields > (because for some reason the Parquet string fields are read as binary in > Drill). I was assuming (hoping?) that on a CTAS from Parquet to Parquet > that the types would be preserved, is that an incorrect assumption? > > Given this new piece of information, are there other steps I may want to > try/attempt? > > Thanks Abdel for the idea! > > > > On Sat, May 28, 2016 at 8:50 AM, John Omernik <[email protected]> wrote: > >> Thanks Ted, I summarized the problem to the Parquet Dev list. At this >> point, and I hate that I have the restrictions on sharing the whole file, I >> am just looking for new ways to troubleshoot the problem. I know the MapR >> support team is scratching their heads on next steps as well. I did offer >> to them, (and I offer to others who may want to look into the problem) a >> screen share with me, even allowing control and in depth troubleshooting. >> The cluster is not yet production, thus I can restart things change debug >> settings, etc, and work with anyone who may be interested. (I know it's not >> much to offer, a time consuming phone call to help someone else on a >> problem) but I do offer it. Any other ideas would also be welcome. >> >> John >> >> >> On Sat, May 28, 2016 at 6:25 AM, Ted Dunning <[email protected]> >> wrote: >> >>> The Parquet user/dev mailing list might be helpful here. They have a real >>> stake in making sure that all readers/writers can work together. The >>> problem here really does sound like there is a borderline case that isn't >>> handled as well in the Drill special purpose parquet reader as in the >>> normal readers. >>> >>> >>> >>> >>> >>> On Fri, May 27, 2016 at 7:23 PM, John Omernik <[email protected]> wrote: >>> >>> > So working with MapR support we tried that with Impala, but it didn't >>> > produce the desired results because the outputfile worked fine in >>> Drill. >>> > Theory: Evil file is created in Mapr Reduce, and is using a different >>> > writer than Impala is using. Impala can read the evil file, but when it >>> > writes it uses it's own writer, "fixing" the issue on the fly. Thus, >>> Drill >>> > can't read evil file, but if we try to reduce with Impala, files is no >>> > longer evil, consider it... chaotic neutral ... (For all you D&D fans ) >>> > >>> > I'd ideally love to extract into badness, but on the phone now with >>> MapR >>> > support to figure out HOW, that is the question at hand. >>> > >>> > John >>> > >>> > On Fri, May 27, 2016 at 10:09 AM, Ted Dunning <[email protected]> >>> > wrote: >>> > >>> > > On Thu, May 26, 2016 at 8:50 PM, John Omernik <[email protected]> >>> wrote: >>> > > >>> > > > So, if we have a known "bad" Parquet file (I use quotes, because >>> > > remember, >>> > > > Impala queries this file just fine) created in Map Reduce, with a >>> > column >>> > > > causing Array Index Out of Bounds problems with a BIGINT typed >>> column. >>> > > What >>> > > > would your next steps be to troubleshoot? >>> > > > >>> > > >>> > > I would start reducing the size of the evil file. >>> > > >>> > > If you have a tool that can query the bad parquet and write a new one >>> > > (sounds like Impala might do here) then selecting just the evil >>> column >>> > is a >>> > > good first step. >>> > > >>> > > After that, I would start bisecting to find a small range that still >>> > causes >>> > > the problem. There may not be such, but it is good thing to try. >>> > > >>> > > At that point, you could easily have the problem down to a few >>> kilobytes >>> > of >>> > > data that can be used in a unit test. >>> > > >>> > >>> >> >> >
