There is a parameter to iter_batches where you can pass in the row_group
number, or a list of row groups. Would this help to read the Parquet file
in parallel?

On Thu, Feb 3, 2022 at 8:31 AM Cindy McMullen <[email protected]> wrote:

> Hi -
>
> I'd like to ingest  batches within a Parquet file in parallel.  The client
> (DGLDataset) needs to be thread-safe.  What's the best API for me to use
> to do so?
>
> Here's the metadata for one example file:
>
>   <pyarrow._parquet.FileMetaData object at 0x7fbb05c64050>
>   created_by: parquet-mr version 1.12.0 (build 
> db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20)
>   num_columns: 4
>   num_rows: 1000000
>   num_row_groups: 9997
>   format_version: 1.0
>   serialized_size: 17824741
>
> I want the consumption of batches to be distributed among multiple
> workers.  I'm currently trying something like this:
>
> # Once per client
>
> pqf = pq.ParquetFile(f, memory_map=True)
>
>
> # Ideally, each worker can do this, but ParquetFile.iter_batches is not 
> thread-safe.  This makes intuitive sense. pq_batches = 
> pqf.iter_batches(self.rows_per_batch, use_pandas_metadata=True)
>
>
>
> My workaround is to buffer these ParquetFile batches into DataFrame [],
> but this is memory-intensive, so will not scale to multiple of these input
> files.
>
> What's a better PyArrow pattern to use so I can distribute batches to my
> workers in a thread-safe manner?
>
> Thanks --
>
>
>
>
>
>

-- 
Partha Dutta
[email protected]

Reply via email to