Micah,

I checked and you are correct, the VectorLoader does not copy anything so
as long as you can create an ArrowBuf then you can initialize a batch of
vectors with that ArrowBuf.  I had thought the VectorLoader did another
copy itself.

The allocators on vectors don't pose a meaningful issue; they just
seem like mild overengineering to me.  The state machine incorporated into
the vectors definitely caused a few WTF moments.

The pathway from an arrow vector into a tech reader goes through the actual
arrowbuf which creates a native buf which is the same thing the mmap
pathway operates on.  It probably is tough to follow for even Clojure
people due to the compile time programming to support many arrow vector
types but the outline is:

For each datatype supported, implement a conversion from the vector
datatype to a tech reader
<https://github.com/techascent/tech.ml.dataset/blob/ffbf40b6f5e3e4c916bb905c28dccaaef5d9e4cc/src/tech/libs/arrow/copying.clj#L378>
via its underlying arrow buffer
<https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/copying.clj#L378>
which
converts to the same native buffer struct the mmap pathway uses
<https://github.com/techascent/tech.datatype/blob/b621dbe8ad94d42e4bd0db261e75fb1c8e03ace1/src/tech/v2/datatype/mmap.clj#L136>
.

This is done with a technique called a protocols
<https://clojure.org/reference/protocols> which is a language feature of
Clojure that allows you to map interfaces to a type after the fact
precisely for situations like this where I want to bind the arrow vectors
to the TechAscent numerics system.  Protocols can cause a noticeable
performance penalty so I use them once to change into a different efficient
representation but not for per-element access.
<https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/copying.clj#L378>
So it is unlikely that anything in the performance tuning guide is going to
make a difference; I don't use the arrow vector accessors in the first
place but rather a one-off conversion of the vector into its data memory
address and then I use the memory address directly.  I did check using the
getSafe accessors, however, and they added a small extra bit of overhead
but not enough to really make a point about.

This means the mmap pathway and the copying pathway boil down into the
exact same code for elementwise access; the timing cannot change between
them.  I was interested in file loading time in a specific case where you
only wanted 1 column out of many, not getSafe/etc. timings which can be
avoided multiple ways.

Thanks for both of your responses :-).

Chris

On Sat, Aug 15, 2020 at 9:26 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Chris,
>
>> The deserialization system should not assume a copy is necessary
>>> <https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L381>.
>>>
>>>
>> This is one of many ways to reconstruct an arrow record batch. We
>> frequently reconstruct without any copies. It'd be great if you looked to
>> contribute some of the improvements you believe are needed back to the
>> project.
>>
>
> +1.  If I didn't say this on the previous thread. IIRC, there is nothing
> about the VectorLoader [1] that assumes copies, this just needs to be
> pushed further down the stack.
>
> My opinion is that a better design for the Arrow JVM bindings would be to
>> have each record batch be potentially allocated but remove allocators from
>> the vectors themselves.
>
>
> Could you expand on this?  What problems to allocators on Vectors present?
>
> Lastly, if you are running benchmarks, please checkout performance tuning
> section of the README [2] which includes environment variables that would
> be set under production scenarios (I had a little trouble following the
> clojure call but it does look like it is calling "get" on the
> Float8Vector?).
>
> -Micah
>
> [1]
> https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java
> [2] https://github.com/apache/arrow/tree/master/java#performance-tuning
>
>
>
>

Reply via email to