For reference, this was also opened as an issue on github, and I answered there: https://github.com/apache/arrow/issues/35126
On Wed, 12 Apr 2023 at 06:34, Jiajun Yao <[email protected]> wrote: > > When the pyarrow table has many chunks, the slice function is slow as > demonstrated with the following code: > > ``` > > import sys > import time > import numpy as np > import pyarrow as pa > > batch_size = 1024 > > batches = [] > for _ in range(8555): > batch = {} > for i in range(10): > batch[str(i)] = np.array([j for j in range(batch_size)]) > batches.append(pa.Table.from_pydict(batch)) > block = pa.concat_tables(batches, promote=True) > > # Without the below line, the time is 345s and with it, the time is 0.07s. > # block = block.combine_chunks() > > start = time.perf_counter() > while block.num_rows > batch_size: > block.slice(0, batch_size) > block = block.slice(batch_size, block.num_rows - batch_size) > > duration = time.perf_counter() - start > print(f"Duration: {duration}") > > ``` > > Several questions: > > Is this slice slowness expected when a table has many chunks? > Is there a way to tell pyarrow.concat_tables to return a table with a single > chunk so I can avoid an extra copy by calling combine_chunks()? > > > -- > Thanks, > Jiajun Yao
