Re: Reading and converting Parquet files intended for Impala

John Omernik Thu, 26 May 2016 12:51:04 -0700

The MapR support folks gave me a good idea to troubleshoot: their can you
hone in on which columns are the problem?  Basically I have near 100 fields
in this table, and the hunch was that only a fields may be at issue. I took
this idea, and wrote a script using Python that took the field list, and
using the REST API, would do the CTAS of a known bad day of data.  It would
fail, I would record that, as well as the file that was failing.   (For
some reason I couldn't get a CTAS on specified files to fail, only when
queried all together). Every iteration, I would take the last field off ,
and try the CTAS again. Eventually, I found the field. It was a BIGINT
field the we will call bad_field. Now, what if I did a select
min(bad_field), max(bad_field from `path/to/knownbad`  Boom that would fail
as well with the same array out of bounds error. Cool. What if I did the
CTAS without that field, Boom, that worked.  (We need a JIRA filed to get
me to stop saying boom).


Ok, I think I am on to something here.   Next step, could I make the
min/max query fail when ONLY querying the file. Yes!  Ok, we are getting
close. This is great, because now instead of 120GB of data, I can look at
240mb of data.  Now, the same min/max in impala work fine, and and I am
unsure what to look at next. I will be doing a webex with MapR Support
tomorrow, but I thought I'd multi thread this too, mainly because if
someone is having a similar problem, I want to keep what I am doing to
solve the problem out in the open.

So, if we have a known "bad" Parquet file (I use quotes, because remember,
Impala queries this file just fine) created in Map Reduce, with a column
causing Array Index Out of Bounds problems with a BIGINT typed column. What
would your next steps be to troubleshoot?



On Mon, May 23, 2016 at 4:16 PM, John Omernik <[email protected]> wrote:

> Troubleshooting this is made more difficult by the fact that the file that
> gives the error works fine when I select directly from it into a new
> table... this makes it very tricky to troubleshoot, any assistance on this
> would be appreciated, I've opened a ticket with MapR as well, but I am
> stumped, and this is our primary use case right now, thus this is a
> blocker. (Note I've tried three different days, two fail, one works)
>
> John
>
> On Mon, May 23, 2016 at 9:48 AM, John Omernik <[email protected]> wrote:
>
>> I have a largish directory of parquet files generated for use in Impala.
>> They were created with the CDH version of apache-parquet-mr (not sure on
>> version at this time)
>>
>> Some settings:
>> Compression: snappy
>> Use Dictionary: true
>> WRITER_VERION: PARQUET_1_0
>>
>> I can read them as is in Drill, however, the strings all come through as
>> binary (see other thread). I can cast all those fields as VARCHAR and read
>> them but take a bad performance hit (2 seconds to read directly from raw
>> parquet, limit 10, but showing binary.  25 seconds to use a view that CASTS
>> all fields into the proper types... data returns accurately, but 10 rows
>> taking 25 seconds is too long)
>>
>> So I want to read from this directory (approx 126GB) and CTAS in a way
>> Drill will be happier.
>>
>> I've tried this two ways. One was just to ctas directly from view I
>> created. All else being default. The other was to set the reader
>> "new_reader" = true. Neither worked, and new_reader actually behaves very
>> badly (need to restart drill bits)  At least the other default reader
>> errors :)
>>
>> store.parquet.use_new_reader = false (the default)
>> This through the error below (it's a truncated error, lots of fireld
>> names and other things.  It stored 6 GB of files and died.
>>
>> store.parquet.use_new_reader = true
>>
>> 1.4 GB of files created and  everything hangs, need to restart drillbits
>> (is this an issue?)
>>
>>
>>
>> Error from "non" new_reader:
>>
>> rror: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014
>>
>>
>>
>> Fragment 1:36
>>
>>
>>
>> [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on
>> atl1ctuzeta05.ctu-bo.secureworks.net:20001]
>>
>>
>>
>>   (org.apache.drill.common.exceptions.DrillRuntimeException) Error in
>> parquet record reader.
>>
>> Message:
>>
>> Hadoop path: /path/to/files/-m-00001.snappy.parquet
>>
>> Total records read: 393120
>>
>> Mock records read: 0
>>
>> Records to read: 32768
>>
>> Row group index: 0
>>
>> Records in row group: 536499
>>
>> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message events {
>>
>> …
>>
>>
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.handleAndRaise():352
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():454
>>
>>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>
>>
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>
>>
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>
>>
>> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>>
>>     java.security.AccessController.doPrivileged():-2
>>
>>     javax.security.auth.Subject.doAs():422
>>
>>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>>
>>     org.apache.drill.common.SelfCleaningRunnable.run():38
>>
>>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>
>>     java.lang.Thread.run():745
>>
>>   Caused By (java.lang.ArrayIndexOutOfBoundsException) 107014
>>
>>
>> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong():164
>>
>>
>> org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readLong():122
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetFixedWidthDictionaryReaders$DictionaryBigIntReader.readField():161
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.readValues():120
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPageData():169
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.determineSize():146
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReader.processPages():107
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.readAllFixedFields():393
>>
>>
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next():439
>>
>>     org.apache.drill.exec.physical.impl.ScanBatch.next():191
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>
>>
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>>
>>
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>>
>>     org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>>
>>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>>
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>>
>>
>> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>>
>>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
>>
>>     java.security.AccessController.doPrivileged():-2
>>
>>     javax.security.auth.Subject.doAs():422
>>
>>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>>
>>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
>>
>>     org.apache.drill.common.SelfCleaningRunnable.run():38
>>
>>     java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>>
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>>
>>     java.lang.Thread.run():745 (state=,code=0)
>>
>
>

Re: Reading and converting Parquet files intended for Impala

Reply via email to