The conflicting metadata values warning is a known issue
https://issues.apache.org/jira/browse/PARQUET-194
The option "parquet.enable.summary-metadata" is a Hadoop option rather
than a Spark option, so you need to either add it to your Hadoop
configuration file(s) or add it via `sparkContext.hadoopConfiguration`
before starting your job.
Cheng
On 8/9/15 8:57 PM, Krzysztof Zarzycki wrote:
Besides finding to this problem, I think I can workaround at least the
WARNING message by overwriting parquet variable:
parquet.enable.summary-metadata
That according to this PARQUET-107
<https://issues.apache.org/jira/browse/PARQUET-107> ticket can be
used to disable writing summary file which is an issue here.
How can I set this variable? I tried
sql.setConf("parquet.enable.summary-metadata", "false")
sql.sql("SET parquet.enable.summary-metadata=false")
As well as: spark-submit --conf parquet.enable.summary-metadata=false
But neither helped. Anyone can help? Of course the original problem
stays open.
Thanks!
Krzysiek
2015-08-09 14:19 GMT+02:00 Krzysztof Zarzycki <[email protected]
<mailto:[email protected]>>:
Hi there,
I have a problem with a spark streaming job running on Spark
1.4.1, that appends to parquet table.
My job receives json strings and creates JsonRdd out of it. The
jsons might come in different shape as most of the fields are
optional. But they never have conflicting schemas.
Next, for each (non-empty) Rdd I'm saving it to parquet files,
using append to existing table:
jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath)
Unfortunately I'm hitting now an issue on every append of conflict:
Aug 9, 2015 7:58:03 AM WARNING:
parquet.hadoop.ParquetOutputCommitter: could not write summary
file for hdfs://example.com:8020/tmp/parquet
<http://example.com:8020/tmp/parquet>
java.lang.RuntimeException: could not merge metadata: key
org.apache.spark.sql.parquet.row.metadata has conflicting values:
[{...schema1...}, {...schema2...} ]
The schemas are very similar, some attributes may be missing
comparing to other, but for sure they are not conflicting. They
are pretty lengthy, but I compared them with diff and ensured,
that there are no conflicts.
Even with this WARNING, the write actually succeeds, I'm able to
read this data. But on every batch, there is yet another schema
in the displayed "conflicting values" array. I would like the job
to run forever, so I can't even ignore this warning because it
will probably end with OOM.
Do you know what might be the reason of this error/ warning? How
to overcome this? Maybe it is a Spark bug/regression? I saw
tickets like SPARK-6010
<https://issues.apache.org/jira/browse/SPARK-6010>, but they seem
to be fixed in 1.3.0 (I'm using 1.4.1).
Thanks for any help!
Krzysiek