Re: Chunk Table into RecordBatches of at most 512MB each

Kevin Liu Sun, 03 Mar 2024 12:18:13 -0800

Thank you. This is a great change. My confusion was mainly around the
concept of "chunks", assuming it meant chunks of memory in bytes. It looks
like in Arrow, chunks represent the number of rows in a batch.


On Mon, Feb 26, 2024 at 1:05 PM Bryce Mecum <[email protected]> wrote:

> I filed a minor PR [1] to improve the documentation so it's clear what
> units are involved as I think the current language is vague.
>
> [1] https://github.com/apache/arrow/pull/40251
>
> On Sun, Feb 25, 2024 at 9:08 PM Kevin Liu <[email protected]> wrote:
> >
> > Hey folks,
> >
> > I'm working with the PyArrow API for Tables and RecordBatches. And I'm
> trying to chunk a Table into a list of RecordBatches each with a default
> chunk size. For example, 10 GB into several 512MB chunks.
> >
> > I'm having a hard time doing this using the existing API. The
> Table.to_batches method has an optional parameter `max_chunksize` which is
> documented as "Maximum size for RecordBatch chunks. Individual chunks may
> be smaller depending on the chunk layout of individual columns." It seems
> exactly like what I want but I've run into a couple of edge cases.
> >
> > Edge case 1, Table created using many RecordBatches
> > ```
> > pylist = [{'n_legs': 2, 'animals': 'Flamingo'},
> >           {'n_legs': 4, 'animals': 'Dog'}]
> > pylist_tbl = pa.Table.from_pylist(pylist)
> > # pylist_tbl.nbytes
> > # > 35
> > multiplier = 2048
> > bigger_pylist_tbl = pa.Table.from_batches(example_tbl.to_batches() *
> multiplier)
> > # bigger_pylist_tbl.nbytes
> > # 591872 / 578.00 KB
> >
> > target_batch_size = 512 * 1024 * 1024  # 512 MB
> > len(bigger_pylist_tbl.to_batches(target_batch_size))
> > # > 2048
> > # expected, 1 RecordBatch
> > ```
> >
> > Edge case 2, really big Table with 1 RecordBatch
> > ```
> > # file already saved on disk
> > with pa.memory_map('table_10000000.arrow', 'r') as source:
> >     huge_arrow_tbl = pa.ipc.open_file(source).read_all()
> >
> > huge_arrow_tbl.nbytes
> > # 7188263146 / 6.69 GB
> > len(huge_arrow_tbl)
> > # 10_000_000
> >
> > target_batch_size = 512 * 1024 * 1024  # 512 MB
> > len(huge_arrow_tbl.to_batches(target_batch_size))
> > # > 1
> > # expected (6.69 GB // 512 MB) + 1 RecordBatches
> > ```
> >
> > I'm currently exploring the underlying implementation for to_batches and
> TableBatchReader::ReadNext.
> > Please let me know if anyone knows a canonical way to satisfy the
> chunking behavior described above.
> >
> > Thanks,
> > Kevin
> >
> >
> >
> >
>

Re: Chunk Table into RecordBatches of at most 512MB each

Reply via email to