Re: Still incompatible schemas

Hamish Whittal Mon, 09 Mar 2020 05:33:57 -0700

Yeah, thanks Zahid for the reply; but that's not it.

I found two schemas that differ. So I have the sucker(s) now...but how to
handle them?


In this case there are two columns, one is a Double and the other is
Decimal(19,5) which in Parquet seems to be represented at
FIXED_LENGTH_BYTE_ARRAY

price:                         OPTIONAL FIXED_LEN_BYTE_ARRAY
L:DECIMAL(19,5) R:0 D:1
    vs
price:                         OPTIONAL DOUBLE R:0 D:1

(1) First thought is to cast the types after the load:
   message_1 = message\
        .withColumn('price', col('price').cast("double"))\
        .withColumn('price_eur', col('price_eur').cast("double"))

This seems to work if this is the only parquet being read from the prefix.
i.e. the file is the same as all the other files in that same prefix. But
it's not. Because there are other parquets in there that have the "correct"
schema, this approach borks. So I have to somehow separate these things out.

(2) Next, perhaps I can Exception on finding these FIXED_LENGTH parquets
and deal with them independently (perhaps copy the file elsewhere and have
a separate process handling them). I guess that would be ok if I could
figure out how to not die when I hit this error, but handle it as an
exception. I can't seem to figure out how to do that.

More thoughts and suggestions are very welcome.

Thanks folks.

On Mon, Mar 9, 2020 at 11:42 AM Zahid Rahman <zahidr1...@gmail.com> wrote:

>
> *This issue of  has been  discussed resolved on this page *
>
> *https://issues.apache.org/jira/browse/SPARK-17557
> <https://issues.apache.org/jira/browse/SPARK-17557>*
>
>
> *It is suggested  by one person that by simply reading the parquet file in
> a different way as illustrated the error may go away. It appears to me you
> are reading the parquet file using the command line. Perhaps if you try it
> programmatically as suggested you may find resolution.*
>
> *"** I encounter an issue when data resides in Hive as parquet format and
> when trying to read from Spark (2.2.1), facing the above issue. I notice
> that in my case there is date field (contains values as 2018, 2017) which
> is written as integer. But when reading in spark as -*
>
> *val df = spark.sql("SELECT * FROM db.table") *
>
>
>
> *df.show(3, false) java.lang.UnsupportedOperationException:
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
> at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)*
>
>
>
>
> *To my surprise when reading same data from s3 location as - val df =
> spark.read.parquet("s3://path/file") df.show(3, false) // this displays the
> results. "*
>
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> <http://www.backbutton.co.uk>
>
>
> On Mon, 9 Mar 2020 at 07:57, Hamish Whittal <ham...@cloud-fundis.co.za>
> wrote:
>
>> Hi folks,
>>
>> Thanks for the help thus far.
>>
>> I'm trying to track down the source of this error:
>>
>> java.lang.UnsupportedOperationException:
>> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
>>
>> w hen doing a message.show()
>>
>> Basically I'm reading in a single Parquet file (to try to narrow things
>> down).
>>
>> I'm defining the schema in the beginning and loading the parquet with:
>>    message = spark\
>>              .read\
>>              .schema(myMessageSchema)\
>>              .format("parquet")\
>>              .option("mergeSchema", "true")\
>>              .option("badRecordsPath", "/tmp/badRecords/")\
>>
>>  
>> .load("hdfs:///user/hadoop/feb20/part-00000-c6da95c9-9c40-4623-a5c5-851188e236ff-c000.snappy.parquet")
>>
>> [I've tried with and without the mergeSchema option]
>> [ sidenote: I was hoping the badRecordPath would help with the truly bad
>> records, but this seems to do nothing]
>>
>> I've also tried to cast the potential problematic columns (so Int, Long,
>> Double, etc) with
>>
>>   message_1 = message\
>>     .withColumn('price', col('price').cast('double'))\
>>     .withColumn('price_eur', col('price_eur').cast('double'))\
>>     .withColumn('cost_usd', col('cost_usd').cast('double'))\
>>     .withColumn('adapter_status', col('adapter_status').cast('long'))
>>
>> Yet I get this error and I can't figure out:
>> (a) whether it's some record WITHIN the parquet file that's causing it and
>> (b) if it is a single record (or a few records) then how do I find those
>> particular records?
>>
>> In the previous time I encountered this, there were records that should
>> have had doubles in them (like "price" above) that actually seemed to have
>> null.
>>
>> I did this to fix that particular problem:
>>
>> if not 'price' in message.columns:
>>     message = message.withColumn('price', message.lit('0'))
>>
>> Any suggestions or help would be MOST welcome. I have also tried using
>> pyarrow to take a look at the Parquet schema and it looks fine. I mean, it
>> doesn't look like the schema in the parquet is the problem - but of course
>> I'm not ruling that out just yet.
>>
>> Thanks for any suggestions,
>>
>> Hamish
>> --
>> Cloud-Fundis.co.za
>> Cape Town, South Africa
>> +27 79 614 4913
>>
>

-- 
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913

Re: Still incompatible schemas

Reply via email to