Besides finding to this problem, I think I can workaround at least the
WARNING message by overwriting parquet variable:
parquet.enable.summary-metadata
That according to this PARQUET-107
<https://issues.apache.org/jira/browse/PARQUET-107> ticket can be used to
disable writing summary file which is an issue here.
How can I set this variable? I tried
sql.setConf("parquet.enable.summary-metadata", "false")
sql.sql("SET parquet.enable.summary-metadata=false")
As well as: spark-submit --conf parquet.enable.summary-metadata=false
But neither helped. Anyone can help? Of course the original problem stays
open.
Thanks!
Krzysiek
2015-08-09 14:19 GMT+02:00 Krzysztof Zarzycki <[email protected]>:
> Hi there,
> I have a problem with a spark streaming job running on Spark 1.4.1, that
> appends to parquet table.
>
> My job receives json strings and creates JsonRdd out of it. The jsons
> might come in different shape as most of the fields are optional. But they
> never have conflicting schemas.
> Next, for each (non-empty) Rdd I'm saving it to parquet files, using
> append to existing table:
>
> jsonRdd.write.mode(SaveMode.Append).parquet(dataDirPath)
>
> Unfortunately I'm hitting now an issue on every append of conflict:
>
> Aug 9, 2015 7:58:03 AM WARNING: parquet.hadoop.ParquetOutputCommitter:
> could not write summary file for hdfs://example.com:8020/tmp/parquet
> java.lang.RuntimeException: could not merge metadata: key
> org.apache.spark.sql.parquet.row.metadata has conflicting values:
> [{...schema1...}, {...schema2...} ]
>
> The schemas are very similar, some attributes may be missing comparing to
> other, but for sure they are not conflicting. They are pretty lengthy, but
> I compared them with diff and ensured, that there are no conflicts.
>
> Even with this WARNING, the write actually succeeds, I'm able to read this
> data. But on every batch, there is yet another schema in the displayed
> "conflicting values" array. I would like the job to run forever, so I can't
> even ignore this warning because it will probably end with OOM.
>
> Do you know what might be the reason of this error/ warning? How to
> overcome this? Maybe it is a Spark bug/regression? I saw tickets like
> SPARK-6010 <https://issues.apache.org/jira/browse/SPARK-6010>, but they
> seem to be fixed in 1.3.0 (I'm using 1.4.1).
>
>
> Thanks for any help!
> Krzysiek
>
>
>