On the "This is slower and less memory efficient than `memmap` by about 15%." -- if you can show us more precisely what code you have written that will help us advise you. In principle if you are using pyarrow.memory_map the performance / memory use shouldn't be significantly different
On Wed, Feb 17, 2021 at 9:57 PM Micah Kornfield <[email protected]> wrote: > Hi Sam, > Could you elaborate on what advantages you were hoping to benefit from > Arrow? It seems like the process you describe is probably close to optimal > (I have limited knowledge of np.memmap). And there could be alternative > suggestions based on the exact shape of your data and how you want to > process it. I added some more comments inline below. > > The current solution is to flatten the array, keep a list of the >> lengths/offsets, store the flattened array in `np.memmap`, then have each >> process slice into the memmap at the right index. >> It seems that with arrow, we can at least delete the list of >> lengths/offsets. > > In Arrow it seems like the natural fit here is to use a ListArray wrapped > around the numpy arrays. This would add back in the indices/offsets. > > padding each entry in the list to a fixed length, and saving pa.Table to >> pa.NativeFile. Each process reads it's own pa.Table. This is slower and >> less memory efficient than `memmap` by about 15%. > > How are you reading back the file? Are you using MemoryMappedFile [1]? > > 1) Are there any examples online that do this sort of operation? I can't >> find how to save chunked array to disk, or a python Flight example after a >> few googles. > > ChunkedArray's aren't a first class citizen in the Arrow File Format > specification. Working through tables that get converted to RecordBatches > when saving is all that is supported. > > > 2) Is it unreasonable to think this will use less memory than np.memmap? > > I'm not familiar with np.memmap, so I can't really say. > > > [1] https://arrow.apache.org/docs/python/generated/pyarrow > > > > On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer <[email protected]> wrote: > >> *My goal* >> I have a list of numpy arrays of uneven length. From the docs, I guess >> the right format for this is ChunkedArray >> I want to save my list to disk in one process, and then start many new >> processes (a pytorch dataloader) that are able to read chunks from the file >> with low memory overhead. >> The current solution is to flatten the array, keep a list of the >> lengths/offsets, store the flattened array in `np.memmap`, then have each >> process slice into the memmap at the right index. >> It seems that with arrow, we can at least delete the list of >> lengths/offsets. >> >> *What I have tried:* >> padding each entry in the list to a fixed length, and saving pa.Table to >> pa.NativeFile. Each process reads it's own pa.Table. This is slower and >> less memory efficient than `memmap` by about 15%. >> >> *My questions:* >> 1) Are there any examples online that do this sort of operation? I can't >> find how to save chunked array to disk, or a python Flight example after a >> few googles. >> 2) Is it unreasonable to think this will use less memory than np.memmap? >> >> Thanks in advance! >> Sam >> >>
