The conflicting metadata values warning is a known issue https://issues.apache.org/jira/browse/PARQUET-194

The option "parquet.enable.summary-metadata" is a Hadoop option rather than a Spark option, so you need to either add it to your Hadoop configuration file(s) or add it via `sparkContext.hadoopConfiguration` before starting your job.

Cheng

On 8/9/15 8:57 PM, Krzysztof Zarzycki wrote:
Besides finding to this problem, I think I can workaround at least the WARNING message by overwriting parquet variable: parquet.enable.summary-metadata That according to this PARQUET-107 <https://issues.apache.org/jira/browse/PARQUET-107> ticket can be used to disable writing summary file which is an issue here.
How can I set this variable? I tried
sql.setConf("parquet.enable.summary-metadata", "false")
sql.sql("SET parquet.enable.summary-metadata=false")
As well as: spark-submit --conf parquet.enable.summary-metadata=false

But neither helped. Anyone can help? Of course the original problem stays open.
Thanks!
Krzysiek

2015-08-09 14:19 GMT+02:00 Krzysztof Zarzycki <[email protected] <mailto:[email protected]>>:

    Hi there,
    I have a problem with a spark streaming job  running on Spark
    1.4.1, that appends to parquet table.

    My job receives json strings and creates JsonRdd out of it. The
    jsons might come in different shape as most of the fields are
    optional. But they never have conflicting schemas.
    Next, for each (non-empty) Rdd I'm saving it to parquet files,
    using append to existing table:

    jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath)

    Unfortunately I'm hitting now an issue on every append of conflict:

    Aug 9, 2015 7:58:03 AM WARNING:
    parquet.hadoop.ParquetOutputCommitter: could not write summary
    file for hdfs://example.com:8020/tmp/parquet
    <http://example.com:8020/tmp/parquet>
    java.lang.RuntimeException: could not merge metadata: key
    org.apache.spark.sql.parquet.row.metadata has conflicting values:
    [{...schema1...}, {...schema2...} ]

    The schemas are very similar, some attributes may be missing
    comparing to other, but for sure they are not conflicting. They
    are pretty lengthy, but I compared them with diff and ensured,
    that there are no conflicts.

    Even with this WARNING, the write actually succeeds, I'm able to
    read this data.  But on every batch, there is yet another schema
    in the displayed "conflicting values" array. I would like the job
    to run forever, so I can't even ignore this warning because it
    will probably end with OOM.

    Do you know what might be the reason of this error/ warning? How
    to overcome this? Maybe it is a Spark bug/regression? I saw
    tickets like SPARK-6010
    <https://issues.apache.org/jira/browse/SPARK-6010>, but they seem
    to be fixed in 1.3.0 (I'm using 1.4.1).


    Thanks for any help!
    Krzysiek




Reply via email to