There is a parameter to iter_batches where you can pass in the row_group number, or a list of row groups. Would this help to read the Parquet file in parallel?
On Thu, Feb 3, 2022 at 8:31 AM Cindy McMullen <[email protected]> wrote: > Hi - > > I'd like to ingest batches within a Parquet file in parallel. The client > (DGLDataset) needs to be thread-safe. What's the best API for me to use > to do so? > > Here's the metadata for one example file: > > <pyarrow._parquet.FileMetaData object at 0x7fbb05c64050> > created_by: parquet-mr version 1.12.0 (build > db75a6815f2ba1d1ee89d1a90aeb296f1f3a8f20) > num_columns: 4 > num_rows: 1000000 > num_row_groups: 9997 > format_version: 1.0 > serialized_size: 17824741 > > I want the consumption of batches to be distributed among multiple > workers. I'm currently trying something like this: > > # Once per client > > pqf = pq.ParquetFile(f, memory_map=True) > > > # Ideally, each worker can do this, but ParquetFile.iter_batches is not > thread-safe. This makes intuitive sense. pq_batches = > pqf.iter_batches(self.rows_per_batch, use_pandas_metadata=True) > > > > My workaround is to buffer these ParquetFile batches into DataFrame [], > but this is memory-intensive, so will not scale to multiple of these input > files. > > What's a better PyArrow pattern to use so I can distribute batches to my > workers in a thread-safe manner? > > Thanks -- > > > > > > -- Partha Dutta [email protected]
