Re: Question the nature of the "Zero Copy" advantages of Apache Arrow

Thomas Browne Tue, 26 Jan 2021 10:15:48 -0800

Yes I think the term "zero copy" was confusing to me. It doesn't quitedo what it says on the tin since if I understand correctly the termstill allows for an actually copy still to occur, it's just a directbinary copy without a [de]serialisation process.


I hear you on plasma.

On the issue of MAP_SHARED, got it, but that means I'm having to talk Cfrom other languages.

I think Jorge's answer(https://arrow.apache.org/docs/format/CDataInterface.html) is prettygood though. Good enough for me anyway. Thanks everyone.


On 26/01/2021 18:09, Daniel Nugent wrote:

I think you might be a bit confused about what zero copy means ifthat’s what you’re concerned about. If you have a bigger than memoryfile, then Plasma wasn’t going to help since its design alwaysinvolved copying the arrow buffers to memory.
If you have larger than memory arrow files in the first place, justopen them using mmap (should be automatically done for non-compressedarrow files).
--
-Dan Nugent
On Jan 26, 2021, 13:07 -0500, Thomas Browne <[email protected]>, wrote:
don't I lose the benefit of mmapping huge files with a ramdisk? Costhe file has to now fit on my ramdisk.
Personally working with financial tick data which can be enormous.

On 26/01/2021 18:00, Daniel Nugent wrote:
Is there a problem with just using a RAM disk as the method forsharing the arrow buffers? It just seems easier and less finickythan a separate API to program against.
It also makes storing the data permanently a lot morestraightforward, I think.
--
-Dan Nugent
On Jan 26, 2021, 12:47 -0500, Thomas Browne <[email protected]>, wrote:
So one of the big advantages of Arrow is the common format inmemory, on
the wire, across languages.

I get that this makes it very easy and fast to transfer data between
nodes, and between languages, which will all share the in-memory format
and therefore the (often expensive) serialisation step is removed.

However, is it true that one of the core objectives of the project is
also to allow shared memory objects across different languages on the
same node? For example, a fast C-based ingest system constantly
populates a pyarrow buffer, which can be read directly by any other
application on that node, through pointer sharing?
If this is a core objective, what is the canonical way forbrokering the
"pointers" to this data between languages? Is it the Plasma store? And
if so, are there plans for Plasma to move be implemented in otherclient
languages?

In short. Is Plasma (or if not Plasma, the functionality it provides
implemented some other way), a core objective of the project?

Or instead is Flight supposed to be used between languages on the same
node, and if so, does Flight provide true zero-copy (ie - the same
buffer, not copying the buffer) if run between processes on thesame node?
Many thanks.

Re: Question the nature of the "Zero Copy" advantages of Apache Arrow

Reply via email to