I think there is some misunderstanding of what I am actually talking about.

If I memory map a 10G file and randomly address within that file the OS
takes care of mapping pages into the process and out.  This memory, while
it does have some metrics against the process doesn't affect the malloc or
new operators and depending on how it is mapped I can share those pages
with other processes.

What I want to do is actually load a very large arrow file 'in-place' and
allow the OS to take care of paging data in and out of the process. In this
case it would be less heap churn than both memory mapping the file (which
is what most C based file loading mechanisms do under the covers) and then
copying sections of that file into a separate memory region which is
what the current Arrow loading mechanism does.

In my experience loading in-place is generally both quite a bit faster than
the current design and allows random access of out-of-memory datasets.  So
both a speed and flexibility win.

So no, in my experience manually managed memory is not faster and it
usually creates a larger memory footprint overall dependent upon various OS
settings and general load.  Arrow's binary format is what allows in-place
loading and my question really was is anyone else working with Arrow via
Java (like the Flink team) interested in developing this pathway.

On Sun, Jul 26, 2020 at 10:33 AM Jacques Nadeau <[email protected]> wrote:

>
>
> On Sun, Jul 26, 2020 at 5:52 AM Chris Nuernberger <[email protected]>
> wrote:
>
>> Hmm, sounds reasonable enough.  I may be mistaken but it appears to me
>> that the fact that the current code relies on mutably updating the vector
>> schema root does preclude concurrent access or parallelized access to
>> multiple record batches.  Potentially a map-batch method that returns a new
>> vector-schema-root each time would work.
>>
>
> Yeah, you could do something like that. The issue you can see depending on
> your vector/batch sizes is increased heap usage. The stream based design of
> the current classes was built so that one minimized heap churn when working
> with large pipelines.
>
>

Reply via email to