Re: Reading and converting Parquet files intended for Impala

John Omernik Mon, 23 May 2016 14:17:32 -0700

Troubleshooting this is made more difficult by the fact that the file that
gives the error works fine when I select directly from it into a new
table... this makes it very tricky to troubleshoot, any assistance on this
would be appreciated, I've opened a ticket with MapR as well, but I am
stumped, and this is our primary use case right now, thus this is a
blocker. (Note I've tried three different days, two fail, one works)


John

On Mon, May 23, 2016 at 9:48 AM, John Omernik <[email protected]> wrote:

> I have a largish directory of parquet files generated for use in Impala.
> They were created with the CDH version of apache-parquet-mr (not sure on
> version at this time)
>
> Some settings:
> Compression: snappy
> Use Dictionary: true
> WRITER_VERION: PARQUET_1_0
>
> I can read them as is in Drill, however, the strings all come through as
> binary (see other thread). I can cast all those fields as VARCHAR and read
> them but take a bad performance hit (2 seconds to read directly from raw
> parquet, limit 10, but showing binary.  25 seconds to use a view that CASTS
> all fields into the proper types... data returns accurately, but 10 rows
> taking 25 seconds is too long)
>
> So I want to read from this directory (approx 126GB) and CTAS in a way
> Drill will be happier.
>
> I've tried this two ways. One was just to ctas directly from view I
> created. All else being default. The other was to set the reader
> "new_reader" = true. Neither worked, and new_reader actually behaves very
> badly (need to restart drill bits)  At least the other default reader
> errors :)
>
> store.parquet.use_new_reader = false (the default)
> This through the error below (it's a truncated error, lots of fireld names
> and other things.  It stored 6 GB of files and died.
>
> store.parquet.use_new_reader = true
>
> 1.4 GB of files created and  everything hangs, need to restart drillbits
> (is this an issue?)
>
>
>
> Error from "non" new_reader:
>
> rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014
>
>
>
> Fragment 1:36
>
>
>
> [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on
> atl1ctuzeta05.ctu-bo.secureworks.net:20001]
>
>
>
>   (org.apache.drill.common.exceptions.DrillRuntimeException) Error in
> parquet record reader.
>
> Message:
>
> Hadoop path: /path/to/files/-m-00001.snappy.parquet
>
> Total records read: 393120
>
> Mock records read: 0
>
> Records to read: 32768
>
> Row group index: 0
>
> Records in row group: 536499
>
> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events {
>
> …
>
>
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454
>
>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
>
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>
>     java.security.AccessController.doPrivileged():-2
>
>     javax.security.auth.Subject.doAs():422
>
>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>
>     java.lang.Thread.run():745
>
>   Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014
>
>
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164
>
>
> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393
>
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439
>
>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>
>
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>
>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
>
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>
>     java.security.AccessController.doPrivileged():-2
>
>     javax.security.auth.Subject.doAs():422
>
>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>
>     java.lang.Thread.run():745 (state=,code=0)
>

Re: Reading and converting Parquet files intended for Impala

Reply via email to