Igor, I am not an arrow developer but to my knowledge only java pathway that can use mmap is the one I wrote for Clojure:
https://techascent.com/blog/memory-mapping-arrow.html The underlying library is tech.ml.dataset <https://github.com/techascent/tech.ml.dataset> and we also have generic python bindings <https://github.com/clj-python/libpython-clj>. I do wonder what the pointer actually points at with pyarrow. Columns themselves may point to up to 3 buffers (data, valid, offsets) in the case of text and usually have 2 data points for data and valid. Potentially the pointer you get back is a pointer to the low level record batch but this specifically cannot have a pointer to a dictionary. Just considering the actual arrow file format a single pointer cannot point to both the schema information (which contains the dictionary) and the record batch column data. There isn't a single column interchange format I am aware of aside from potentially writing a streaming format with a single column. On Wed, Dec 30, 2020 at 8:08 AM Igor <[email protected]> wrote: > Hello Apache Arrow developers! > > We are using apache arrow library in java and python, using arrow-vector > arrow-memory-unsafe in java and Pyarrow in python. > > We try to implement in memory zero copy DataFrame, but we can’t find > appropriate API in java libraries to get memory address of our vectors from > python. I have found that API in Pyarrow library, but not in java libraries. > > What we need: > 1) Create vector in java, collect data in memory using arrow as memory map > API > 2) Get memory address or descriptor in java > 3) Pass it to the python library Pyarrow > 4) Read vector data > > We have problem in the point 2 > > Tell us please, how we can do that. Thank you! > > > Best regards, > Eshtyganov Igor > https://www.upgini.com >
