> > I found that C++ read data buffer is > [nullptr, 500 number, nullptr, 500 number], if chunk_size =10, > I got [nullptr, 40 number, nullptr, 40 number ...] , which makes me > confusing , why a usefulless nullptr Buffer before every Buffer ?
The buffer is ArrayData() reflect the Arrow layout. The nullptr elides validity buffers where there are no null values. Regarding pre-allocation, this has been discussed before but no-one has contributed any implementation for it. The last conversation was [1]. It doesn't mention memory mapping but I think that could potentially be fit in with the right abstractions. [1] https://www.mail-archive.com/[email protected]/msg19862.html On Sun, Aug 2, 2020 at 6:42 PM comic fans <[email protected]> wrote: > Hello everyone, I'm trying to write out a dataframe in feather format > from R and read it in C++, > > my R code looks like this: > > arrow::write_feather(data.frame(a=1:1000, b= 1000:1), > 'arrow.data', chunk_size=500, compression= 'uncompressed') > > and my C++ code looks like this: > > auto column0 = table->column(0); > for(int i=0; i< column0->num_chunks();++i){ > auto array = column0->chunk(i); > auto buffers = array->data()->buffers; > for(int j=0;j<buffers.size();++j){ > if(!buffers[j]){ > std::cout<<j<<" null"<<std::endl; > }else{ > std::cout<<j<<" "<<buffers[j]->size()<<std::endl; > } > } > } > > I found that C++ read data buffer is > [nullptr, 500 number, nullptr, 500 number], if chunk_size =10, > I got [nullptr, 40 number, nullptr, 40 number ...] , which makes me > confusing , why a usefulless nullptr Buffer before every Buffer ? > > another question is how to use arrow as a zero-copy TSDB, my > intention: > > 1. historic and new written data must be in contiguous memory , > can not be chunked (so I can't makes historic readonly part > and newly writable part in different buffer) > 2. historic data may be very big so I need it memory mapped > 3. I also want to use memory map to persist new written data > (don't have strict transaction requirements, OS scheduled flush > is OK to me) > 4. how many new data to write is known, so preallocated memory > mapped file is OK. > 5. all components live in same process, no cross-process > communicate needed (so apache plasma not needed) > 6. easily exchange data with R > > firstly I think arrow is a good fit , but with some docs reading , I > realize the buffer in arrow can't be modified, if I a feather file > with array size preallocated, all data became readonly when reload > it (through memory mapped file interface) . I abuse arrow by > const cast the data pointer and write into it , since it's memory > mapped, modification do change the file as I intend, but I'd like > to know if there is better way to achieve my goal ? does arrow > intend to support such usecase and I missed some API ? > any advise will be helpful. >
