The distinction between heap and off-heap is confusing to someone who works in both java and c++ but I understand what you are saying; there is some minimal overhead there. What I keep trying to say is that when you use malloc (or create a new object in the JVM) you are allocating memory that can’t be paged out of process; it is forced to be ‘on heap’ whether it is native or java heap is independent of that. When you use a memory mapped file, you can access more ‘heap’ as in I can let the OS page data in and out of memory which isn’t something you can unless perhaps netty always uses a file backing store or something like that.
It isn’t like native or ‘off-heap’ is just magic memory that disappears and can be ignored; to the OS there is not much distinction between the JVM heap and native heap but mmap memory *is* actually in a different category than both those before. I can identify use cases and algorithms where each is better. I bet in general you are completely wrong and a solid in-place loading mechanism would win in every case where it is available but I want to investigate with some concrete cases before going further; the OS is pretty good at caching pages. What algorithms are you thinking of when you think that an in-place loading mechanism would be at all slower and when you say that you experience drastically more of one type than another? I can think of two I would test to see just as perf comparison: 1. Load a file and copy the data into the JVM heap. 2. Load a file and repeatedly reference random elements within the dataset Trying 1,2 with different sized files timed against the input-stream pathway, the file pathway and a never-before-seen memory mapped pathway I think would clarify things. My assertion is that in-place loading would win in all cases against the current design but of course I have been surprised before. You are correct in that the use case I am thinking of is a large dataset that I then do needle-in-haystack or some more-or-less aggressive filter operations against as a first step. That is where my second question about per-batch string tables comes in; I would like to stream data and save it as batches but to access it randomly each batch needs to be somewhat self-contained. Perhaps I should just save each batch in its own file or something but I thought, from my reading of the data spec, that the arrow binary definition supported per-batch string tables. Honestly that part is a bit unclear to me in the spec - how partial or per-batch string tables are supposed to be handled and stored. On Sun, Jul 26, 2020 at 12:23 PM Jacques Nadeau <[email protected]> wrote: > > If I memory map a 10G file and randomly address within that file the OS >> takes care of mapping pages into the process and out. This memory, while >> it does have some metrics against the process doesn't affect the malloc or >> new operators and depending on how it is mapped I can share those pages >> with other processes. >> > > When I was talking about heap churn I was talking about the overhead of > objects referencing the mapped memory, not the memory containing the data > itself. ArrowBuf, for example, uses somewhere between 100-200 bytes of heap > memory when using the Netty buffer, independent of the memory it is > pointing at. This heap memory is used for things like reference counting, > hierarchical limit tracking, etc. Arrow Java always uses off-heap memory > for the data so no heap churn happens due to the data. > > >> So no, in my experience manually managed memory is not faster and it >> usually creates a larger memory footprint overall dependent upon various OS >> settings and general load. >> > > :) There are no requirements here to agree with my perspective. I can > identify use cases and algorithms where each is better. We happen to > experience drastically more of one type than the other. If your use case is > persisting an Arrow dataset long-term and then doing known-position > needle-in-a-haystack or streaming reads from fast storage within it, you'll > benefit greatly from this pattern. > > Arrow's binary format is what allows in-place loading and my question >> really was is anyone else working with Arrow via Java (like the Flink team) >> interested in developing this pathway. >> > > I'm not aware of anyone actively working on that for Java. We'd welcome > the work. >
