Jia Yu created SEDONA-455:
-----------------------------

             Summary: Add a new data source namely geoparquet.metadata
                 Key: SEDONA-455
                 URL: https://issues.apache.org/jira/browse/SEDONA-455
             Project: Apache Sedona
          Issue Type: New Feature
            Reporter: Jia Yu


Can we add a new data source to only read the file level metadata of a parquet 
file? This is crucial for entry-level users to explore an unknown parquet file 
including geoparquet. In our geoparquet case, this will help user know the 
projjson value since we are not able to properly parse it to a known epsg code.

I understand that a Spark DataFrame only allows the schema to be the metadata, 
which cannot be used to hold such information.

So I suggest that we add a new data source namely {{{}geoparquet.metadata{}}}, 
which loads these metadata using {{{}ParquetFileReader 
(https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java){}}}.
 One good example is from DuckDB: 
[duckdb.org/docs/data/parquet/metadata.html|https://duckdb.org/docs/data/parquet/metadata.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to