Hi everyone, For a project I'm working on I've picked Arrow as the library and either Feather or Parquet as our storage format for our tabular data. However, I also have some hyperspectral data to serialize and I'd prefer not to add another big dependency if I can avoid it so I've been trying to make something in Arrow work for my application. Typically our hyperspectral data is [N, 4096]-shaped, where N is in the tens of millions.
Initially I looked at `arrow.Tensor` via the IPC module but it seems a bit limited. You can memory-map it, if it's uncompressed. If you compress it, you have no means to decompress individual chunks, from what I can tell from prototyping within Python. You also cannot attach metadata to it. I do have an associated Table as each spectrum has metadata, but if I split up the spectra to one per row, I end up with 10s of millions of individual `numpy.ndarray` objects which causes a lot of performance issues. The data is contiguous, but I would have to write some C-extension to slice and view the data (which would be a pain to manage the reference counting) and there's still no means to partially load the data. I could create a Table with one column per chunk and one cell per column. This is clunky. I took a look at breaking up the array into a list of RecordBatch and `RecordBatchStreamReader` doesn't seem to allow you to read only selected indices, so no real chunking support. Or is there some other lightweight (not HDF5), cloud-friendly solution that I should be looking at? Sincerely, Robert -- Robert McLeod [email protected] [email protected]
