Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]

2024-04-13 Thread via GitHub


wgtmac commented on issue #40958:
URL: https://github.com/apache/arrow/issues/40958#issuecomment-2053670644

   Awkward. I even didn't notice that this is already supported. Thanks for 
pointing it out!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]

2024-04-12 Thread via GitHub


mrbrahman commented on issue #40958:
URL: https://github.com/apache/arrow/issues/40958#issuecomment-2052977688

   @wgtmac, sorry I didn't quite understand 
   
   > it still need some refactoring work to support the metadata file
   
   pyarrow already supports metadata file, right?
   
   
https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]

2024-04-12 Thread via GitHub


wgtmac commented on issue #40958:
URL: https://github.com/apache/arrow/issues/40958#issuecomment-2051969756

   Before talking about the `zip` API, it still need some refactoring work to 
support the metadata file. What is your use case then? The possible use case 
for metadata file is to combine parquet files of different columns into a 
larger logical parquet file. From my perspective, if metadata file is not 
widely used, it seems not worth the effort to implement it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]

2024-04-11 Thread via GitHub


mrbrahman commented on issue #40958:
URL: https://github.com/apache/arrow/issues/40958#issuecomment-2050501230

   @wgtmac, no I don't think _metadata file would be widely used in the bigdata 
systems like Hadoop/Spark etc. However, with Apache Arrow it does seem to have 
the required API (in ParquetFile) to read metadata separately from the data.
   
   Of course, I'm also not sure if Apache Arrow will also specifically support 
the below section from my request (because we currently have no way to stitch 2 
metadata files): 
   
   > One this is done, a combined data can be created using:
   > ~~~python
   > m = pq.read_metadata('_metadata')
   > data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)
   > 
   > # data should now be able to show all columns
   > ~~~
   
   where file1.parquet contains col1, col2, co3 and file2.parquet contains col4 
and co5 (different set of columns). Here only the the _metadata file has the 
overarching information about the 'table' definition.
   
   I'm only guessing that it would be supported, since it has the API to do so. 
However, it would be nice to confirm that as well.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]

2024-04-09 Thread via GitHub


wgtmac commented on issue #40958:
URL: https://github.com/apache/arrow/issues/40958#issuecomment-2046443096

   I think this is the parquet summary metadata file. See 
`parquet.summary.metadata.level` from  
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md#class-parquetoutputformat.
 But I don't know whether it is widely used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]

2024-04-09 Thread via GitHub


mapleFU commented on issue #40958:
URL: https://github.com/apache/arrow/issues/40958#issuecomment-2045737817

   @wgtmac does spec allowing this currently?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org