Re: Help with writing Apache Arrow tables to shared memory.

Wes McKinney Fri, 12 Oct 2018 00:18:01 -0700

hi Bipin,

On Fri, Oct 12, 2018 at 12:09 AM Bipin Mathew <[email protected]> wrote:
>
> Good Evening Everyone,
>
>     Circling back to this ask. I just wanted to suggest, that instead of a an 
> email thread, it maybe more valuable for some of the noobs out there like me, 
> to have available an "Apache Arrow and Shared Memory", "Hello World" example 
> program on the developer wiki, possibly without the additional complication 
> of Plasma.

Given the current project development / maintenance workload, I doubt
that that anyone can change their priorities right now to provide more
user documentation. If you would like to contribute some examples,
that would be very helpful.

Relatedly: I'm actively seeking sponsorship / donations for my
not-for-profit development group Ursa Labs which just works on Apache
Arrow. The more sponsorship we have, the more "free" help we can
provide.

> I managed to get many ancillary features of Apache Arrow working ( IPC for 
> example ), but have not quite closed the circle on the raison d'etre for 
> Apache Arrow, which is efficiently sharing tables and record batches in 
> shared memory. It is not even obvious to me, if it is possible to construct 
> the tables in shared memory or if they have to be copied there after being 
> constructed elsewhere.

If the data is located in shared memory, and you have a metadata
descriptor (as defined in Message/Schema.fbs) describing the locations
of the component memory buffers, then you can use
arrow::ipc::ReadRecordBatch to do a zero copy "read" into an
arrow::RecordBatch.

How the data gets to shared memory (whether materialized in RAM, then
copied, or materialized directly there) and how the metadata
descriptor is constructed may vary a lot. We have tried to make it
easy for the simple case where you write from RAM and then do
zero-copy read. Beyond that (i.e. if you want to avoid allocating any
RAM) I think you're going to need to dig into the details of the IPC
protocol. The buffers constituting a RecordBatch don't need to be
contiguous, for example.

>
>     I also happened to come across this, currently unanswered, question on 
> stack overflow which references an approach I was thinking about ( basically 
> create a shared memory subclass for MemoryPool ), but was not sure that was 
> the appropriate level of the stack at which to attack this problem.
>
> https://stackoverflow.com/questions/52673910/allocate-apache-arrow-memory-pool-in-external-memory

This could be possible -- the implementation could end up being rather
complex, though (e.g. I could see the implementation of "Reallocate"
being tricky).

The most reliable way to materialize directly into shared memory is to
determine the buffer sizes ahead of time, create a large enough shared
memory page, write data into it (while building your own metadata
descriptor -- probably need to use Flatbuffers directly), then put the
metadata descriptor somewhere. I'm not sure else what the Arrow
project could provide to make this process easier (one thing: we could
provide an API for building your own RecordBatch descriptors without
having to use Flatbuffers directly)

Thanks
Wes

>
> Another approach I was considering is subclassing form ResizeableBuffer, but 
> was not sure if that is the right method either since I was not sure if I 
> could construct tables in shared memory without copying.
>
> Thank you to this great community for all your help in this matter. I am very 
> excited about this project and its prospects.
>
> Regards,
>
> Bipin
>
>
>
> On Wed, Oct 3, 2018 at 4:37 PM Bipin Mathew <[email protected]> wrote:
>>
>> Totally understandable. Thank you Wes! We can continue this correspondence 
>> there. Looking forward to the 0.11 release :-)
>>
>> Regards,
>>
>> Bipin
>>
>> On Wed, Oct 3, 2018 at 4:22 PM Wes McKinney <[email protected]> wrote:
>>>
>>> hi Bipin -- I will reply to your mail on the dev@ mailing list but it
>>> may take me some time. I'm traveling internationally to conferences
>>> and also have been focused on moving the 0.11 release forward.
>>>
>>> - Wes
>>> On Wed, Oct 3, 2018 at 12:00 PM Bipin Mathew <[email protected]> wrote:
>>> >
>>> > Good Morning Everyone,
>>> >
>>> >     I originally posted this question to the dev channel, not knowing a 
>>> > user channel was available. This channel is more probably more 
>>> > appropriate and I am hoping the kind souls here can help me. How, 
>>> > fundamentally, are we expected, to copy or indeed directly write a arrow 
>>> > table to shared memory using the cpp sdk? Currently, I have an 
>>> > implementation like this:
>>> >
>>> >>  77   std::shared_ptr<arrow::Buffer> B;
>>> >>  78   std::shared_ptr<arrow::io::BufferOutputStream> buffer;
>>> >>  79   std::shared_ptr<arrow::ipc::RecordBatchWriter> writer;
>>> >>  80   arrow::MemoryPool* pool = arrow::default_memory_pool();
>>> >>  81   arrow::io::BufferOutputStream::Create(4096,pool,&buffer);
>>> >>  82   std::shared_ptr<arrow::Table> table;
>>> >>  83   karrow::ArrowHandle *h;
>>> >>  84   h = (karrow::ArrowHandle *)Kj(khandle);
>>> >>  85   table = h->table;
>>> >>  86
>>> >>  87   
>>> >> arrow::ipc::RecordBatchStreamWriter::Open(buffer.get(),table->schema(),&writer);
>>> >>  88   writer->WriteTable(*table);
>>> >>  89   writer->Close();
>>> >>  90   buffer->Finish(&B);
>>> >>  91
>>> >>  92   // printf("Investigate Memory usage.");
>>> >>  93   // getchar();
>>> >>  94
>>> >>  95
>>> >>  96   std::shared_ptr<arrow::io::MemoryMappedFile> mm;
>>> >>  97   
>>> >> arrow::io::MemoryMappedFile::Create("/dev/shm/arrow_table",B->size(),&mm);
>>> >>  98   mm->Write(B->data(),B->size());
>>> >>  99   mm->Close();
>>> >
>>> >
>>> > "table" on line 85 is a shared_ptr to a arrow::Table object. As you can 
>>> > see there, I write to an arrow:Buffer then write that to a memory mapped 
>>> > file. Is there a more direct approach? I watched this video of a talk 
>>> > @Wes McKinney gave here:
>>> >
>>> > https://www.dremio.com/webinars/arrow-c++-roadmap-and-pandas2/
>>> >
>>> > Where a method: arrow::MemoryMappedBuffer was referenced, but I have not 
>>> > seen any documentation regarding this function. Has it been deprecated?
>>> >
>>> > Also, as I mentioned, "table" up there is a arrow::Table object. I create 
>>> > it columnwise using various arrow::[type]Builder functions. Is there 
>>> > anyway to actually even write the original table directly into shared 
>>> > memory? Any guidance on the proper way to do these things would be 
>>> > greatly appreciated.
>>> >
>>> > Regards,
>>> >
>>> > Bipin

Re: Help with writing Apache Arrow tables to shared memory.

Reply via email to