[Python] Saving ChunkedArray to disk and reading with flight

Sam Shleifer Wed, 17 Feb 2021 19:11:43 -0800

*My goal*

I have a list of numpy arrays of uneven length. From the docs, I guess the 
right format for this is ChunkedArray


I want to save my list to disk in one process, and then start many new 
processes (a pytorch dataloader) that are able to read chunks from the file 
with low memory overhead.

The current solution is to flatten the array, keep a list of the 
lengths/offsets, store the flattened array in  `np.memmap`, then have each 
process slice into the memmap at the right index.

It seems that with arrow, we can at least delete the list of lengths/offsets.

*What I have tried:*

padding each entry in the list to a fixed length, and saving pa.Table to 
pa.NativeFile. Each process reads it's own pa.Table. This is slower and less 
memory efficient than `memmap` by about 15%.

*My questions:*

1) Are there any examples online that do this sort of operation? I can't find 
how to save chunked array to disk, or a python Flight example after a few 
googles.

2) Is it unreasonable to think this will use less memory than np.memmap?

Thanks in advance!

Sam

[Python] Saving ChunkedArray to disk and reading with flight

Reply via email to