The MapR support folks gave me a good idea to troubleshoot: their can you hone in on which columns are the problem? Basically I have near 100 fields in this table, and the hunch was that only a fields may be at issue. I took this idea, and wrote a script using Python that took the field list, and using the REST API, would do the CTAS of a known bad day of data. It would fail, I would record that, as well as the file that was failing. (For some reason I couldn't get a CTAS on specified files to fail, only when queried all together). Every iteration, I would take the last field off , and try the CTAS again. Eventually, I found the field. It was a BIGINT field the we will call bad_field. Now, what if I did a select min(bad_field), max(bad_field from `path/to/knownbad` Boom that would fail as well with the same array out of bounds error. Cool. What if I did the CTAS without that field, Boom, that worked. (We need a JIRA filed to get me to stop saying boom).
Ok, I think I am on to something here. Next step, could I make the min/max query fail when ONLY querying the file. Yes! Ok, we are getting close. This is great, because now instead of 120GB of data, I can look at 240mb of data. Now, the same min/max in impala work fine, and and I am unsure what to look at next. I will be doing a webex with MapR Support tomorrow, but I thought I'd multi thread this too, mainly because if someone is having a similar problem, I want to keep what I am doing to solve the problem out in the open. So, if we have a known "bad" Parquet file (I use quotes, because remember, Impala queries this file just fine) created in Map Reduce, with a column causing Array Index Out of Bounds problems with a BIGINT typed column. What would your next steps be to troubleshoot? On Mon, May 23, 2016 at 4:16 PM, John Omernik <[email protected]> wrote: > Troubleshooting this is made more difficult by the fact that the file that > gives the error works fine when I select directly from it into a new > table... this makes it very tricky to troubleshoot, any assistance on this > would be appreciated, I've opened a ticket with MapR as well, but I am > stumped, and this is our primary use case right now, thus this is a > blocker. (Note I've tried three different days, two fail, one works) > > John > > On Mon, May 23, 2016 at 9:48 AM, John Omernik <[email protected]> wrote: > >> I have a largish directory of parquet files generated for use in Impala. >> They were created with the CDH version of apache-parquet-mr (not sure on >> version at this time) >> >> Some settings: >> Compression: snappy >> Use Dictionary: true >> WRITER_VERION: PARQUET_1_0 >> >> I can read them as is in Drill, however, the strings all come through as >> binary (see other thread). I can cast all those fields as VARCHAR and read >> them but take a bad performance hit (2 seconds to read directly from raw >> parquet, limit 10, but showing binary. 25 seconds to use a view that CASTS >> all fields into the proper types... data returns accurately, but 10 rows >> taking 25 seconds is too long) >> >> So I want to read from this directory (approx 126GB) and CTAS in a way >> Drill will be happier. >> >> I've tried this two ways. One was just to ctas directly from view I >> created. All else being default. The other was to set the reader >> "new_reader" = true. Neither worked, and new_reader actually behaves very >> badly (need to restart drill bits) At least the other default reader >> errors :) >> >> store.parquet.use_new_reader = false (the default) >> This through the error below (it's a truncated error, lots of fireld >> names and other things. It stored 6 GB of files and died. >> >> store.parquet.use_new_reader = true >> >> 1.4 GB of files created and everything hangs, need to restart drillbits >> (is this an issue?) >> >> >> >> Error from "non" new_reader: >> >> rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014 >> >> >> >> Fragment 1:36 >> >> >> >> [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on >> atl1ctuzeta05.ctu-bo.secureworks.net:20001] >> >> >> >> (org.apache.drill.common.exceptions.DrillRuntimeException) Error in >> parquet record reader. >> >> Message: >> >> Hadoop path: /path/to/files/-m-00001.snappy.parquet >> >> Total records read: 393120 >> >> Mock records read: 0 >> >> Records to read: 32768 >> >> Row group index: 0 >> >> Records in row group: 536499 >> >> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events { >> >> … >> >> >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352 >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454 >> >> org.apache.drill.exec.physical.impl.ScanBatch.next():191 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >> >> >> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >> >> >> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> >> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> >> org.apache.drill.exec.physical.impl.BaseRootExec.next():104 >> >> >> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 >> >> org.apache.drill.exec.physical.impl.BaseRootExec.next():94 >> >> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 >> >> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 >> >> java.security.AccessController.doPrivileged():-2 >> >> javax.security.auth.Subject.doAs():422 >> >> org.apache.hadoop.security.UserGroupInformation.doAs():1595 >> >> org.apache.drill.exec.work.fragment.FragmentExecutor.run():251 >> >> org.apache.drill.common.SelfCleaningRunnable.run():38 >> >> java.util.concurrent.ThreadPoolExecutor.runWorker():1142 >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run():617 >> >> java.lang.Thread.run():745 >> >> Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014 >> >> >> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164 >> >> >> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122 >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161 >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120 >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169 >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146 >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107 >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393 >> >> >> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439 >> >> org.apache.drill.exec.physical.impl.ScanBatch.next():191 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >> >> >> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >> >> >> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> >> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91 >> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> >> org.apache.drill.exec.physical.impl.BaseRootExec.next():104 >> >> >> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 >> >> org.apache.drill.exec.physical.impl.BaseRootExec.next():94 >> >> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 >> >> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 >> >> java.security.AccessController.doPrivileged():-2 >> >> javax.security.auth.Subject.doAs():422 >> >> org.apache.hadoop.security.UserGroupInformation.doAs():1595 >> >> org.apache.drill.exec.work.fragment.FragmentExecutor.run():251 >> >> org.apache.drill.common.SelfCleaningRunnable.run():38 >> >> java.util.concurrent.ThreadPoolExecutor.runWorker():1142 >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run():617 >> >> java.lang.Thread.run():745 (state=,code=0) >> > >
