Re: Merge metadata error when appending to parquet table

Cheng Lian Sun, 09 Aug 2015 09:21:47 -0700

The conflicting metadata values warning is a known issuehttps://issues.apache.org/jira/browse/PARQUET-194

The option "parquet.enable.summary-metadata" is a Hadoop option ratherthan a Spark option, so you need to either add it to your Hadoopconfiguration file(s) or add it via `sparkContext.hadoopConfiguration`before starting your job.


Cheng

On 8/9/15 8:57 PM, Krzysztof Zarzycki wrote:

Besides finding to this problem, I think I can workaround at least theWARNING message by overwriting parquet variable:parquet.enable.summary-metadataThat according to this PARQUET-107<https://issues.apache.org/jira/browse/PARQUET-107> ticket can beused to disable writing summary file which is an issue here.

How can I set this variable? I tried
sql.setConf("parquet.enable.summary-metadata", "false")
sql.sql("SET parquet.enable.summary-metadata=false")
As well as: spark-submit --conf parquet.enable.summary-metadata=false

But neither helped. Anyone can help? Of course the original problemstays open.

Thanks!
Krzysiek

2015-08-09 14:19 GMT+02:00 Krzysztof Zarzycki <[email protected]<mailto:[email protected]>>:


    Hi there,
    I have a problem with a spark streaming job  running on Spark
    1.4.1, that appends to parquet table.

    My job receives json strings and creates JsonRdd out of it. The
    jsons might come in different shape as most of the fields are
    optional. But they never have conflicting schemas.
    Next, for each (non-empty) Rdd I'm saving it to parquet files,
    using append to existing table:

    jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath)

    Unfortunately I'm hitting now an issue on every append of conflict:

    Aug 9, 2015 7:58:03 AM WARNING:
    parquet.hadoop.ParquetOutputCommitter: could not write summary
    file for hdfs://example.com:8020/tmp/parquet
    <http://example.com:8020/tmp/parquet>
    java.lang.RuntimeException: could not merge metadata: key
    org.apache.spark.sql.parquet.row.metadata has conflicting values:
    [{...schema1...}, {...schema2...} ]

    The schemas are very similar, some attributes may be missing
    comparing to other, but for sure they are not conflicting. They
    are pretty lengthy, but I compared them with diff and ensured,
    that there are no conflicts.

    Even with this WARNING, the write actually succeeds, I'm able to
    read this data.  But on every batch, there is yet another schema
    in the displayed "conflicting values" array. I would like the job
    to run forever, so I can't even ignore this warning because it
    will probably end with OOM.

    Do you know what might be the reason of this error/ warning? How
    to overcome this? Maybe it is a Spark bug/regression? I saw
    tickets like SPARK-6010
    <https://issues.apache.org/jira/browse/SPARK-6010>, but they seem
    to be fixed in 1.3.0 (I'm using 1.4.1).


    Thanks for any help!
    Krzysiek

Re: Merge metadata error when appending to parquet table

Reply via email to