Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]
wgtmac commented on issue #40958: URL: https://github.com/apache/arrow/issues/40958#issuecomment-2053670644 Awkward. I even didn't notice that this is already supported. Thanks for pointing it out! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]
mrbrahman commented on issue #40958: URL: https://github.com/apache/arrow/issues/40958#issuecomment-2052977688 @wgtmac, sorry I didn't quite understand > it still need some refactoring work to support the metadata file pyarrow already supports metadata file, right? https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]
wgtmac commented on issue #40958: URL: https://github.com/apache/arrow/issues/40958#issuecomment-2051969756 Before talking about the `zip` API, it still need some refactoring work to support the metadata file. What is your use case then? The possible use case for metadata file is to combine parquet files of different columns into a larger logical parquet file. From my perspective, if metadata file is not widely used, it seems not worth the effort to implement it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]
mrbrahman commented on issue #40958: URL: https://github.com/apache/arrow/issues/40958#issuecomment-2050501230 @wgtmac, no I don't think _metadata file would be widely used in the bigdata systems like Hadoop/Spark etc. However, with Apache Arrow it does seem to have the required API (in ParquetFile) to read metadata separately from the data. Of course, I'm also not sure if Apache Arrow will also specifically support the below section from my request (because we currently have no way to stitch 2 metadata files): > One this is done, a combined data can be created using: > ~~~python > m = pq.read_metadata('_metadata') > data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m) > > # data should now be able to show all columns > ~~~ where file1.parquet contains col1, col2, co3 and file2.parquet contains col4 and co5 (different set of columns). Here only the the _metadata file has the overarching information about the 'table' definition. I'm only guessing that it would be supported, since it has the API to do so. However, it would be nice to confirm that as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]
wgtmac commented on issue #40958: URL: https://github.com/apache/arrow/issues/40958#issuecomment-2046443096 I think this is the parquet summary metadata file. See `parquet.summary.metadata.level` from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md#class-parquetoutputformat. But I don't know whether it is widely used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata [arrow]
mapleFU commented on issue #40958: URL: https://github.com/apache/arrow/issues/40958#issuecomment-2045737817 @wgtmac does spec allowing this currently? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org