Micah, I checked and you are correct, the VectorLoader does not copy anything so as long as you can create an ArrowBuf then you can initialize a batch of vectors with that ArrowBuf. I had thought the VectorLoader did another copy itself.
The allocators on vectors don't pose a meaningful issue; they just seem like mild overengineering to me. The state machine incorporated into the vectors definitely caused a few WTF moments. The pathway from an arrow vector into a tech reader goes through the actual arrowbuf which creates a native buf which is the same thing the mmap pathway operates on. It probably is tough to follow for even Clojure people due to the compile time programming to support many arrow vector types but the outline is: For each datatype supported, implement a conversion from the vector datatype to a tech reader <https://github.com/techascent/tech.ml.dataset/blob/ffbf40b6f5e3e4c916bb905c28dccaaef5d9e4cc/src/tech/libs/arrow/copying.clj#L378> via its underlying arrow buffer <https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/copying.clj#L378> which converts to the same native buffer struct the mmap pathway uses <https://github.com/techascent/tech.datatype/blob/b621dbe8ad94d42e4bd0db261e75fb1c8e03ace1/src/tech/v2/datatype/mmap.clj#L136> . This is done with a technique called a protocols <https://clojure.org/reference/protocols> which is a language feature of Clojure that allows you to map interfaces to a type after the fact precisely for situations like this where I want to bind the arrow vectors to the TechAscent numerics system. Protocols can cause a noticeable performance penalty so I use them once to change into a different efficient representation but not for per-element access. <https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/copying.clj#L378> So it is unlikely that anything in the performance tuning guide is going to make a difference; I don't use the arrow vector accessors in the first place but rather a one-off conversion of the vector into its data memory address and then I use the memory address directly. I did check using the getSafe accessors, however, and they added a small extra bit of overhead but not enough to really make a point about. This means the mmap pathway and the copying pathway boil down into the exact same code for elementwise access; the timing cannot change between them. I was interested in file loading time in a specific case where you only wanted 1 column out of many, not getSafe/etc. timings which can be avoided multiple ways. Thanks for both of your responses :-). Chris On Sat, Aug 15, 2020 at 9:26 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Chris, > >> The deserialization system should not assume a copy is necessary >>> <https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L381>. >>> >>> >> This is one of many ways to reconstruct an arrow record batch. We >> frequently reconstruct without any copies. It'd be great if you looked to >> contribute some of the improvements you believe are needed back to the >> project. >> > > +1. If I didn't say this on the previous thread. IIRC, there is nothing > about the VectorLoader [1] that assumes copies, this just needs to be > pushed further down the stack. > > My opinion is that a better design for the Arrow JVM bindings would be to >> have each record batch be potentially allocated but remove allocators from >> the vectors themselves. > > > Could you expand on this? What problems to allocators on Vectors present? > > Lastly, if you are running benchmarks, please checkout performance tuning > section of the README [2] which includes environment variables that would > be set under production scenarios (I had a little trouble following the > clojure call but it does look like it is calling "get" on the > Float8Vector?). > > -Micah > > [1] > https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java > [2] https://github.com/apache/arrow/tree/master/java#performance-tuning > > > >