Joris Van den Bossche created ARROW-16339: ---------------------------------------------
Summary: [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata Key: ARROW-16339 URL: https://issues.apache.org/jira/browse/ARROW-16339 Project: Apache Arrow Issue Type: Improvement Components: C++, Parquet, Python Reporter: Joris Van den Bossche Context: I ran into this issue when reading Parquet files created by GDAL (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which writes files that have custom key_value_metadata, but without storing ARROW:schema in those metadata (cc [~paleolimbot] — Both in reading and writing files, I expected that we would map Arrow {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. But apparently this doesn't (always) happen out of the box, and only happens through the "ARROW:schema" field (which stores the original Arrow schema, and thus the metadata stored in this schema). For example, when writing a Table with schema metadata, this is not stored directly in the Parquet FileMetaData (code below is using branch from ARROW-16337 to have the {{store_schema}} keyword): {code:python} import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"}) pq.write_table(table, "test_metadata_with_arrow_schema.parquet") pq.write_table(table, "test_metadata_without_arrow_schema.parquet", store_schema=False) # original schema has metadata >>> table.schema a: int64 -- schema metadata -- key: 'value' # reading back only has the metadata in case we stored ARROW:schema >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema a: int64 -- schema metadata -- key: 'value' # and not if ARROW:schema is absent >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema a: int64 {code} It seems that if we store the ARROW:schema, we _also_ store the schema metadata separately. But if {{store_schema}} is False, we also stop writing those metadata (not fully sure if this is the intended behaviour, and that's the reason for the above output): {code:python} # when storing the ARROW:schema, we ALSO store key:value metadata >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata {b'ARROW:schema': b'/////7AAAAAQAAAAAAAKAA4ABgAFAA...', b'key': b'value'} # when not storing the schema, we also don't store the key:value >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata is >>> None True {code} On the reading side, it seems that we generally do read custom key/value metadata into schema metadata. We don't have the pyarrow APIs at the moment to create such a file (given the above), but with a small patch I could create such a file: {code:python} # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata {b'key': b'value'} # this metadata is now correctly mapped to the Arrow schema metadata >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet") a: int64 -- schema metadata -- key: 'value' {code} But if you have a file that has both custom key/value metadata and an "ARROW:schema" key, we actually ignore the custom keys, and only look at the "ARROW:schema" one. This was the case that I ran into with GDAL, where I have a file with both keys, but where the custom "geo" key is not also included in the serialized arrow schema in the "ARROW:schema" key: {code:python} # includes both keys in the Parquet file >>> pq.read_metadata("test_gdal.parquet").metadata {b'geo': b'{"version":"0.1.0","...', b'ARROW:schema': b'/////3gBAAAQ...'} # the "geo" key is lost in the Arrow schema >>> pq.read_table("test_gdal.parquet").schema.metadata is None True {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)