I see, this makes sense. We should probably add this in Spark SQL.
However, there's one corner case to note about user-defined Parquet
metadata. When committing a write job, ParquetOutputCommitter writes
Parquet summary files (_metadata and _common_metadata), and user-defined
key-value metadata written in all Parquet part-files get merged here.
The problem is that, if a single key is associated with multiple values,
Parquet doesn't know how to reconcile this situation, and simply gives
up writing summary files. This can be particular annoying for appending.
In general, users should avoid storing "unstable" values like timestamps
as Parquet metadata.
Cheng
On 9/22/15 1:58 AM, Borisa Zivkovic wrote:
thanks for answer.
I need this in order to be able to track schema metadata.
basically when I create parquet files from Spark I want to be able to
"tag" them in some way (giving the schema appropriate name or
attaching some key/values) and then it is fairly easy to get basic
metadata about parquet files when processing and discovering those
later on.
On Mon, 21 Sep 2015 at 18:17 Cheng Lian <lian.cs....@gmail.com
<mailto:lian.cs....@gmail.com>> wrote:
Currently Spark SQL doesn't support customizing schema name and
metadata. May I know why these two matters in your use case? Some
Parquet data models, like parquet-avro, do support it, while some
others
don't (e.g. parquet-hive).
Cheng
On 9/21/15 7:13 AM, Borisa Zivkovic wrote:
> Hi,
>
> I am trying to figure out how to write parquet metadata when
> persisting DataFrames to parquet using Spark (1.4.1)
>
> I could not find a way to change schema name (which seems to be
> hardcoded to root) and also how to add data to key/value metadata in
> parquet footer.
>
> org.apache.parquet.hadoop.metadata.FileMetaData#getKeyValueMetaData
>
> org.apache.parquet.schema.Type#getName
>
> thanks
>
>