Troubleshooting this is made more difficult by the fact that the file that gives the error works fine when I select directly from it into a new table... this makes it very tricky to troubleshoot, any assistance on this would be appreciated, I've opened a ticket with MapR as well, but I am stumped, and this is our primary use case right now, thus this is a blocker. (Note I've tried three different days, two fail, one works)
John On Mon, May 23, 2016 at 9:48 AM, John Omernik <[email protected]> wrote: > I have a largish directory of parquet files generated for use in Impala. > They were created with the CDH version of apache-parquet-mr (not sure on > version at this time) > > Some settings: > Compression: snappy > Use Dictionary: true > WRITER_VERION: PARQUET_1_0 > > I can read them as is in Drill, however, the strings all come through as > binary (see other thread). I can cast all those fields as VARCHAR and read > them but take a bad performance hit (2 seconds to read directly from raw > parquet, limit 10, but showing binary. 25 seconds to use a view that CASTS > all fields into the proper types... data returns accurately, but 10 rows > taking 25 seconds is too long) > > So I want to read from this directory (approx 126GB) and CTAS in a way > Drill will be happier. > > I've tried this two ways. One was just to ctas directly from view I > created. All else being default. The other was to set the reader > "new_reader" = true. Neither worked, and new_reader actually behaves very > badly (need to restart drill bits) At least the other default reader > errors :) > > store.parquet.use_new_reader = false (the default) > This through the error below (it's a truncated error, lots of fireld names > and other things. It stored 6 GB of files and died. > > store.parquet.use_new_reader = true > > 1.4 GB of files created and everything hangs, need to restart drillbits > (is this an issue?) > > > > Error from "non" new_reader: > > rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014 > > > > Fragment 1:36 > > > > [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on > atl1ctuzeta05.ctu-bo.secureworks.net:20001] > > > > (org.apache.drill.common.exceptions.DrillRuntimeException) Error in > parquet record reader. > > Message: > > Hadoop path: /path/to/files/-m-00001.snappy.parquet > > Total records read: 393120 > > Mock records read: 0 > > Records to read: 32768 > > Row group index: 0 > > Records in row group: 536499 > > Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events { > > … > > > > > org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352 > > > org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454 > > org.apache.drill.exec.physical.impl.ScanBatch.next():191 > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > > org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91 > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 > > > org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 > > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 > > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 > > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 > > java.security.AccessController.doPrivileged():-2 > > javax.security.auth.Subject.doAs():422 > > org.apache.hadoop.security.UserGroupInformation.doAs():1595 > > org.apache.drill.exec.work.fragment.FragmentExecutor.run():251 > > org.apache.drill.common.SelfCleaningRunnable.run():38 > > java.util.concurrent.ThreadPoolExecutor.runWorker():1142 > > java.util.concurrent.ThreadPoolExecutor$Worker.run():617 > > java.lang.Thread.run():745 > > Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014 > > > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164 > > > org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122 > > > org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161 > > > org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120 > > > org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169 > > > org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146 > > > org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107 > > > org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393 > > > org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439 > > org.apache.drill.exec.physical.impl.ScanBatch.next():191 > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 > > > org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 > > org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91 > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 > > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 > > > org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92 > > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 > > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 > > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 > > java.security.AccessController.doPrivileged():-2 > > javax.security.auth.Subject.doAs():422 > > org.apache.hadoop.security.UserGroupInformation.doAs():1595 > > org.apache.drill.exec.work.fragment.FragmentExecutor.run():251 > > org.apache.drill.common.SelfCleaningRunnable.run():38 > > java.util.concurrent.ThreadPoolExecutor.runWorker():1142 > > java.util.concurrent.ThreadPoolExecutor$Worker.run():617 > > java.lang.Thread.run():745 (state=,code=0) >
