I have a largish directory of parquet files generated for use in Impala.
They were created with the CDH version of apache-parquet-mr (not sure on
version at this time)

Some settings:
Compression: snappy
Use Dictionary: true
WRITER_VERION: PARQUET_1_0

I can read them as is in Drill, however, the strings all come through as
binary (see other thread). I can cast all those fields as VARCHAR and read
them but take a bad performance hit (2 seconds to read directly from raw
parquet, limit 10, but showing binary.  25 seconds to use a view that CASTS
all fields into the proper types... data returns accurately, but 10 rows
taking 25 seconds is too long)

So I want to read from this directory (approx 126GB) and CTAS in a way
Drill will be happier.

I've tried this two ways. One was just to ctas directly from view I
created. All else being default. The other was to set the reader
"new_reader" = true. Neither worked, and new_reader actually behaves very
badly (need to restart drill bits)  At least the other default reader
errors :)

store.parquet.use_new_reader = false (the default)
This through the error below (it's a truncated error, lots of fireld names
and other things.  It stored 6 GB of files and died.

store.parquet.use_new_reader = true

1.4 GB of files created and  everything hangs, need to restart drillbits
(is this an issue?)



Error from "non" new_reader:

rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014



Fragment 1:36



[Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on
atl1ctuzeta05.ctu-bo.secureworks.net:20001]



  (org.apache.drill.common.exceptions.DrillRuntimeException) Error in
parquet record reader.

Message:

Hadoop path: /path/to/files/-m-00001.snappy.parquet

Total records read: 393120

Mock records read: 0

Records to read: 32768

Row group index: 0

Records in row group: 536499

Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events {

…




org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352


org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454

    org.apache.drill.exec.physical.impl.ScanBatch.next():191

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.physical.impl.BaseRootExec.next():104


org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92

    org.apache.drill.exec.physical.impl.BaseRootExec.next():94

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251

    java.security.AccessController.doPrivileged():-2

    javax.security.auth.Subject.doAs():422

    org.apache.hadoop.security.UserGroupInformation.doAs():1595

    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251

    org.apache.drill.common.SelfCleaningRunnable.run():38

    java.util.concurrent.ThreadPoolExecutor.runWorker():1142

    java.util.concurrent.ThreadPoolExecutor$Worker.run():617

    java.lang.Thread.run():745

  Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014


org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164


org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122


org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161


org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120


org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169


org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146


org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107


org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393


org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439

    org.apache.drill.exec.physical.impl.ScanBatch.next():191

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51


org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.record.AbstractRecordBatch.next():119

    org.apache.drill.exec.record.AbstractRecordBatch.next():109

    org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91

    org.apache.drill.exec.record.AbstractRecordBatch.next():162

    org.apache.drill.exec.physical.impl.BaseRootExec.next():104


org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92

    org.apache.drill.exec.physical.impl.BaseRootExec.next():94

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257

    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251

    java.security.AccessController.doPrivileged():-2

    javax.security.auth.Subject.doAs():422

    org.apache.hadoop.security.UserGroupInformation.doAs():1595

    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251

    org.apache.drill.common.SelfCleaningRunnable.run():38

    java.util.concurrent.ThreadPoolExecutor.runWorker():1142

    java.util.concurrent.ThreadPoolExecutor$Worker.run():617

    java.lang.Thread.run():745 (state=,code=0)

Reply via email to