Hi Kevin,

I'm not an Arrow dev so take everything I say with a grain a salt. I just 
wanted to point out that the `max_chunksize` appears to refer to the max number 
of *rows* per batch rather than the number of *bytes* per batch: 
https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/cpp/src/arrow/table.cc#L647-L649.

Additionally, `Table.to_batches()` is documented as being zero-copy, and as you 
referred: "Individual chunks may be smaller depending on the chunk layout of 
individual columns". This method will not copy the data in an effort to make 
RecordBatches of more uniform size.

In light of this information:


*In Case #1: `example_tbl.to_batches() * multiplier` creates 2048 
RecordBatches, and the Table becomes a zero-copy wrapper on top of the data for 
each of these RecordBatches. Each ChunkedArray in the Table will contain 2048 
chunks/arrays. When this Table is converted via `to_batches()`, by its 
zero-copy nature, the chunks will not be concatenated. Therefore 
`bigger_pylist_tbl.to_batches` will yield at least 2048 batches.


*In Case #2: `max_chunksize > len(huge_arrow_tbl)`, and you mention that the 
Table only has a single RecordBatch, so there is nothing to chunk up.

I'm not sufficiently familiar with the Arrow APIs to know of a way to chunk by 
a target number of bytes, so I'll let others chime in on that.

Best,
Shoumyo

From: [email protected] At: 02/26/24 01:08:41 UTC-5:00To:  
[email protected]
Subject: Chunk Table into RecordBatches of at most 512MB each

Hey folks, 

I'm working with the PyArrow API for Tables and RecordBatches. And I'm trying 
to chunk a Table into a list of RecordBatches each with a default chunk size. 
For example, 10 GB into several 512MB chunks. 

I'm having a hard time doing this using the existing API. The Table.to_batches 
method has an optional parameter `max_chunksize` which is documented as 
"Maximum size for RecordBatch chunks. Individual chunks may be smaller 
depending on the chunk layout of individual columns." It seems exactly like 
what I want but I've run into a couple of edge cases. 

Edge case 1, Table created using many RecordBatches
```
pylist = [{'n_legs': 2, 'animals': 'Flamingo'},
          {'n_legs': 4, 'animals': 'Dog'}]
pylist_tbl = pa.Table.from_pylist(pylist)
# pylist_tbl.nbytes 
# > 35
multiplier = 2048
bigger_pylist_tbl = pa.Table.from_batches(example_tbl.to_batches() * multiplier)
# bigger_pylist_tbl.nbytes
# 591872 / 578.00 KB

target_batch_size = 512 * 1024 * 1024  # 512 MB
len(bigger_pylist_tbl.to_batches(target_batch_size))
# > 2048
# expected, 1 RecordBatch
```

Edge case 2, really big Table with 1 RecordBatch
```
# file already saved on disk
with pa.memory_map('table_10000000.arrow', 'r') as source:
    huge_arrow_tbl = pa.ipc.open_file(source).read_all()

huge_arrow_tbl.nbytes
# 7188263146 / 6.69 GB
len(huge_arrow_tbl)
# 10_000_000

target_batch_size = 512 * 1024 * 1024  # 512 MB
len(huge_arrow_tbl.to_batches(target_batch_size))
# > 1
# expected (6.69 GB // 512 MB) + 1 RecordBatches
```

I'm currently exploring the underlying implementation for to_batches and 
TableBatchReader::ReadNext.
Please let me know if anyone knows a canonical way to satisfy the chunking 
behavior described above. 

Thanks,
Kevin


Reply via email to