ParquetFile API and GCS file

Cindy McMullen Wed, 22 Dec 2021 21:17:19 -0800

Hi -

I need to drop down to the ParquetFile API so I can have better control
over batch size for reading huge Parquet files.  The filename is:


*gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy*

This invocation fails:
*pqf = pq.ParquetFile(filename)*
"FileNotFoundError: [Errno 2] Failed to open local file
'gs://graph_infra_steel_thread/output_pq/parquet/usersims/output-20211202-220329-20211202-220329-00-0012.parquet.snappy'.
Detail: [errno 2] No such file or directory"

While this API, using the same, succeeds because I can specify 'gs'
filesystem.
*table = pq.read_table(filename, filesystem=gs, use_legacy_dataset=False) *

I don't see a way to specify 'filesystem' on the ParquetFile API
<https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile>.
Is there any way to read a GCS file using ParquetFile?

If not, can you show me the code for reading batches using pq.read_table or
one of the other Arrow Parquet APIs
<https://arrow.apache.org/docs/python/api/formats.html#parquet-files>?

Thanks -

-- Cindy

ParquetFile API and GCS file

Reply via email to