I have a largish directory of parquet files generated for use in Impala.
They were created with the CDH version of apache-parquet-mr (not sure on
version at this time)
Some settings:
Compression: snappy
Use Dictionary: true
WRITER_VERION: PARQUET_1_0
I can read them as is in Drill, however, the strings all come through as
binary (see other thread). I can cast all those fields as VARCHAR and read
them but take a bad performance hit (2 seconds to read directly from raw
parquet, limit 10, but showing binary. 25 seconds to use a view that CASTS
all fields into the proper types... data returns accurately, but 10 rows
taking 25 seconds is too long)
So I want to read from this directory (approx 126GB) and CTAS in a way
Drill will be happier.
I've tried this two ways. One was just to ctas directly from view I
created. All else being default. The other was to set the reader
"new_reader" = true. Neither worked, and new_reader actually behaves very
badly (need to restart drill bits) At least the other default reader
errors :)
store.parquet.use_new_reader = false (the default)
This through the error below (it's a truncated error, lots of fireld names
and other things. It stored 6 GB of files and died.
store.parquet.use_new_reader = true
1.4 GB of files created and everything hangs, need to restart drillbits
(is this an issue?)
Error from "non" new_reader:
rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014
Fragment 1:36
[Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on
atl1ctuzeta05.ctu-bo.secureworks.net:20001]
(org.apache.drill.common.exceptions.DrillRuntimeException) Error in
parquet record reader.
Message:
Hadoop path: /path/to/files/-m-00001.snappy.parquet
Total records read: 393120
Mock records read: 0
Records to read: 32768
Row group index: 0
Records in row group: 536499
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events {
…
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454
org.apache.drill.exec.physical.impl.ScanBatch.next():191
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1595
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745
Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164
org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122
org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161
org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120
org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169
org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146
org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439
org.apache.drill.exec.physical.impl.ScanBatch.next():191
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1595
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)