Hi Kevin, I'm not an Arrow dev so take everything I say with a grain a salt. I just wanted to point out that the `max_chunksize` appears to refer to the max number of *rows* per batch rather than the number of *bytes* per batch: https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/cpp/src/arrow/table.cc#L647-L649.
Additionally, `Table.to_batches()` is documented as being zero-copy, and as you referred: "Individual chunks may be smaller depending on the chunk layout of individual columns". This method will not copy the data in an effort to make RecordBatches of more uniform size. In light of this information: *In Case #1: `example_tbl.to_batches() * multiplier` creates 2048 RecordBatches, and the Table becomes a zero-copy wrapper on top of the data for each of these RecordBatches. Each ChunkedArray in the Table will contain 2048 chunks/arrays. When this Table is converted via `to_batches()`, by its zero-copy nature, the chunks will not be concatenated. Therefore `bigger_pylist_tbl.to_batches` will yield at least 2048 batches. *In Case #2: `max_chunksize > len(huge_arrow_tbl)`, and you mention that the Table only has a single RecordBatch, so there is nothing to chunk up. I'm not sufficiently familiar with the Arrow APIs to know of a way to chunk by a target number of bytes, so I'll let others chime in on that. Best, Shoumyo From: [email protected] At: 02/26/24 01:08:41 UTC-5:00To: [email protected] Subject: Chunk Table into RecordBatches of at most 512MB each Hey folks, I'm working with the PyArrow API for Tables and RecordBatches. And I'm trying to chunk a Table into a list of RecordBatches each with a default chunk size. For example, 10 GB into several 512MB chunks. I'm having a hard time doing this using the existing API. The Table.to_batches method has an optional parameter `max_chunksize` which is documented as "Maximum size for RecordBatch chunks. Individual chunks may be smaller depending on the chunk layout of individual columns." It seems exactly like what I want but I've run into a couple of edge cases. Edge case 1, Table created using many RecordBatches ``` pylist = [{'n_legs': 2, 'animals': 'Flamingo'}, {'n_legs': 4, 'animals': 'Dog'}] pylist_tbl = pa.Table.from_pylist(pylist) # pylist_tbl.nbytes # > 35 multiplier = 2048 bigger_pylist_tbl = pa.Table.from_batches(example_tbl.to_batches() * multiplier) # bigger_pylist_tbl.nbytes # 591872 / 578.00 KB target_batch_size = 512 * 1024 * 1024 # 512 MB len(bigger_pylist_tbl.to_batches(target_batch_size)) # > 2048 # expected, 1 RecordBatch ``` Edge case 2, really big Table with 1 RecordBatch ``` # file already saved on disk with pa.memory_map('table_10000000.arrow', 'r') as source: huge_arrow_tbl = pa.ipc.open_file(source).read_all() huge_arrow_tbl.nbytes # 7188263146 / 6.69 GB len(huge_arrow_tbl) # 10_000_000 target_batch_size = 512 * 1024 * 1024 # 512 MB len(huge_arrow_tbl.to_batches(target_batch_size)) # > 1 # expected (6.69 GB // 512 MB) + 1 RecordBatches ``` I'm currently exploring the underlying implementation for to_batches and TableBatchReader::ReadNext. Please let me know if anyone knows a canonical way to satisfy the chunking behavior described above. Thanks, Kevin
