Re: Reading Parquet Files in Chunks?

Wes McKinney Mon, 09 Dec 2019 03:40:47 -0800

There is but it's not exposed in Python yet

See the "batch_size" parameter of ArrowReaderProperties

https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L565

and the GetRecordBatchReader method on parquet::arrow::FileReader.
There's some related work happening in the C++ Datasets project

I'd like to see batch-based reading refined and better documented both
in C++ and Python, this would be a nice project for a volunteer to
take on.

- Wes

On Sun, Dec 8, 2019 at 9:00 PM Zhuo Jia Dai <[email protected]> wrote:
>
>
> For example, pandas's read_csv has a chunk_size argument which allows the 
> read_csv to return an iterator on the CSV file so we can read it in chunks.
>
> The Parquet format stores the data in chunks, but there isn't a documented 
> way to read in it chunks like read_csv.
>
> Is there a way to read parquet files in chunks?
>
> --
> ZJ
>
> [email protected]

Re: Reading Parquet Files in Chunks?

Reply via email to