Joris Van den Bossche created ARROW-16339:
---------------------------------------------

             Summary: [C++][Parquet] Parquet FileMetaData key_value_metadata 
not always mapped to Arrow Schema metadata
                 Key: ARROW-16339
                 URL: https://issues.apache.org/jira/browse/ARROW-16339
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Parquet, Python
            Reporter: Joris Van den Bossche


Context: I ran into this issue when reading Parquet files created by GDAL 
(using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which 
writes files that have custom key_value_metadata, but without storing 
ARROW:schema in those metadata (cc [~paleolimbot]

—

Both in reading and writing files, I expected that we would map Arrow 
{{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. But 
apparently this doesn't (always) happen out of the box, and only happens 
through the "ARROW:schema" field (which stores the original Arrow schema, and 
thus the metadata stored in this schema).

For example, when writing a Table with schema metadata, this is not stored 
directly in the Parquet FileMetaData (code below is using branch from 
ARROW-16337 to have the {{store_schema}} keyword):
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
pq.write_table(table, "test_metadata_without_arrow_schema.parquet", 
store_schema=False)

# original schema has metadata
>>> table.schema
a: int64
-- schema metadata --
key: 'value'

# reading back only has the metadata in case we stored ARROW:schema
>>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
a: int64
-- schema metadata --
key: 'value'
# and not if ARROW:schema is absent
>>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
a: int64
{code}
It seems that if we store the ARROW:schema, we _also_ store the schema metadata 
separately. But if {{store_schema}} is False, we also stop writing those 
metadata (not fully sure if this is the intended behaviour, and that's the 
reason for the above output):
{code:python}
# when storing the ARROW:schema, we ALSO store key:value metadata
>>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
{b'ARROW:schema': b'/////7AAAAAQAAAAAAAKAA4ABgAFAA...',
 b'key': b'value'}
# when not storing the schema, we also don't store the key:value
>>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata is 
>>> None
True
{code}
On the reading side, it seems that we generally do read custom key/value 
metadata into schema metadata. We don't have the pyarrow APIs at the moment to 
create such a file (given the above), but with a small patch I could create 
such a file:
{code:python}
# a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
>>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
{b'key': b'value'}

# this metadata is now correctly mapped to the Arrow schema metadata
>>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
a: int64
-- schema metadata --
key: 'value'
{code}
But if you have a file that has both custom key/value metadata and an 
"ARROW:schema" key, we actually ignore the custom keys, and only look at the 
"ARROW:schema" one. 
This was the case that I ran into with GDAL, where I have a file with both 
keys, but where the custom "geo" key is not also included in the serialized 
arrow schema in the "ARROW:schema" key:
{code:python}
# includes both keys in the Parquet file
>>> pq.read_metadata("test_gdal.parquet").metadata
{b'geo': b'{"version":"0.1.0","...',
 b'ARROW:schema': b'/////3gBAAAQ...'}
# the "geo" key is lost in the Arrow schema
>>> pq.read_table("test_gdal.parquet").schema.metadata is None
True
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to