hi Bipin, On Fri, Oct 12, 2018 at 12:09 AM Bipin Mathew <[email protected]> wrote: > > Good Evening Everyone, > > Circling back to this ask. I just wanted to suggest, that instead of a an > email thread, it maybe more valuable for some of the noobs out there like me, > to have available an "Apache Arrow and Shared Memory", "Hello World" example > program on the developer wiki, possibly without the additional complication > of Plasma.
Given the current project development / maintenance workload, I doubt that that anyone can change their priorities right now to provide more user documentation. If you would like to contribute some examples, that would be very helpful. Relatedly: I'm actively seeking sponsorship / donations for my not-for-profit development group Ursa Labs which just works on Apache Arrow. The more sponsorship we have, the more "free" help we can provide. > I managed to get many ancillary features of Apache Arrow working ( IPC for > example ), but have not quite closed the circle on the raison d'etre for > Apache Arrow, which is efficiently sharing tables and record batches in > shared memory. It is not even obvious to me, if it is possible to construct > the tables in shared memory or if they have to be copied there after being > constructed elsewhere. If the data is located in shared memory, and you have a metadata descriptor (as defined in Message/Schema.fbs) describing the locations of the component memory buffers, then you can use arrow::ipc::ReadRecordBatch to do a zero copy "read" into an arrow::RecordBatch. How the data gets to shared memory (whether materialized in RAM, then copied, or materialized directly there) and how the metadata descriptor is constructed may vary a lot. We have tried to make it easy for the simple case where you write from RAM and then do zero-copy read. Beyond that (i.e. if you want to avoid allocating any RAM) I think you're going to need to dig into the details of the IPC protocol. The buffers constituting a RecordBatch don't need to be contiguous, for example. > > I also happened to come across this, currently unanswered, question on > stack overflow which references an approach I was thinking about ( basically > create a shared memory subclass for MemoryPool ), but was not sure that was > the appropriate level of the stack at which to attack this problem. > > https://stackoverflow.com/questions/52673910/allocate-apache-arrow-memory-pool-in-external-memory This could be possible -- the implementation could end up being rather complex, though (e.g. I could see the implementation of "Reallocate" being tricky). The most reliable way to materialize directly into shared memory is to determine the buffer sizes ahead of time, create a large enough shared memory page, write data into it (while building your own metadata descriptor -- probably need to use Flatbuffers directly), then put the metadata descriptor somewhere. I'm not sure else what the Arrow project could provide to make this process easier (one thing: we could provide an API for building your own RecordBatch descriptors without having to use Flatbuffers directly) Thanks Wes > > Another approach I was considering is subclassing form ResizeableBuffer, but > was not sure if that is the right method either since I was not sure if I > could construct tables in shared memory without copying. > > Thank you to this great community for all your help in this matter. I am very > excited about this project and its prospects. > > Regards, > > Bipin > > > > On Wed, Oct 3, 2018 at 4:37 PM Bipin Mathew <[email protected]> wrote: >> >> Totally understandable. Thank you Wes! We can continue this correspondence >> there. Looking forward to the 0.11 release :-) >> >> Regards, >> >> Bipin >> >> On Wed, Oct 3, 2018 at 4:22 PM Wes McKinney <[email protected]> wrote: >>> >>> hi Bipin -- I will reply to your mail on the dev@ mailing list but it >>> may take me some time. I'm traveling internationally to conferences >>> and also have been focused on moving the 0.11 release forward. >>> >>> - Wes >>> On Wed, Oct 3, 2018 at 12:00 PM Bipin Mathew <[email protected]> wrote: >>> > >>> > Good Morning Everyone, >>> > >>> > I originally posted this question to the dev channel, not knowing a >>> > user channel was available. This channel is more probably more >>> > appropriate and I am hoping the kind souls here can help me. How, >>> > fundamentally, are we expected, to copy or indeed directly write a arrow >>> > table to shared memory using the cpp sdk? Currently, I have an >>> > implementation like this: >>> > >>> >> 77 std::shared_ptr<arrow::Buffer> B; >>> >> 78 std::shared_ptr<arrow::io::BufferOutputStream> buffer; >>> >> 79 std::shared_ptr<arrow::ipc::RecordBatchWriter> writer; >>> >> 80 arrow::MemoryPool* pool = arrow::default_memory_pool(); >>> >> 81 arrow::io::BufferOutputStream::Create(4096,pool,&buffer); >>> >> 82 std::shared_ptr<arrow::Table> table; >>> >> 83 karrow::ArrowHandle *h; >>> >> 84 h = (karrow::ArrowHandle *)Kj(khandle); >>> >> 85 table = h->table; >>> >> 86 >>> >> 87 >>> >> arrow::ipc::RecordBatchStreamWriter::Open(buffer.get(),table->schema(),&writer); >>> >> 88 writer->WriteTable(*table); >>> >> 89 writer->Close(); >>> >> 90 buffer->Finish(&B); >>> >> 91 >>> >> 92 // printf("Investigate Memory usage."); >>> >> 93 // getchar(); >>> >> 94 >>> >> 95 >>> >> 96 std::shared_ptr<arrow::io::MemoryMappedFile> mm; >>> >> 97 >>> >> arrow::io::MemoryMappedFile::Create("/dev/shm/arrow_table",B->size(),&mm); >>> >> 98 mm->Write(B->data(),B->size()); >>> >> 99 mm->Close(); >>> > >>> > >>> > "table" on line 85 is a shared_ptr to a arrow::Table object. As you can >>> > see there, I write to an arrow:Buffer then write that to a memory mapped >>> > file. Is there a more direct approach? I watched this video of a talk >>> > @Wes McKinney gave here: >>> > >>> > https://www.dremio.com/webinars/arrow-c++-roadmap-and-pandas2/ >>> > >>> > Where a method: arrow::MemoryMappedBuffer was referenced, but I have not >>> > seen any documentation regarding this function. Has it been deprecated? >>> > >>> > Also, as I mentioned, "table" up there is a arrow::Table object. I create >>> > it columnwise using various arrow::[type]Builder functions. Is there >>> > anyway to actually even write the original table directly into shared >>> > memory? Any guidance on the proper way to do these things would be >>> > greatly appreciated. >>> > >>> > Regards, >>> > >>> > Bipin
