[Python][C++] Chunked Storage of N-dim arrays

Robert McLeod Tue, 23 Apr 2024 10:45:43 -0700

Hi everyone,

For a project I'm working on I've picked Arrow as the library and either
Feather or Parquet as our storage format for our tabular data. However, I
also have some hyperspectral data to serialize and I'd prefer not to add
another big dependency if I can avoid it so I've been trying to make
something in Arrow work for my application. Typically our
hyperspectral data is [N, 4096]-shaped, where N is in the tens of millions.


Initially I looked at `arrow.Tensor` via the IPC module but it seems a bit
limited. You can memory-map it, if it's uncompressed. If you compress it,
you have no means to decompress individual chunks, from what I can tell
from prototyping within Python. You also cannot attach metadata to it.

I do have an associated Table as each spectrum has metadata, but if I split
up the spectra to one per row, I end up with 10s of millions of individual
`numpy.ndarray` objects which causes a lot of performance issues. The data
is contiguous, but I would have to write some C-extension to slice and view
the data (which would be a pain to manage the reference counting) and
there's still no means to partially load the data.

I could create a Table with one column per chunk and one cell per column.
This is clunky.

I took a look at breaking up the array into a list of RecordBatch and
`RecordBatchStreamReader` doesn't seem to allow you to read only selected
indices, so no real chunking support.

Or is there some other lightweight (not HDF5), cloud-friendly solution that
I should be looking at?

Sincerely,
Robert

-- 
Robert McLeod
[email protected]
[email protected]

[Python][C++] Chunked Storage of N-dim arrays

Reply via email to