[C++] Writing a large Arrow table to disk

Jeff Wickstrom Mon, 03 May 2021 11:24:37 -0700

Hello,

I am getting started and prototyping with the Arrow C++ API. Great stuff! I 
have some newbie questions.


My first goal is to write a bunch of "records" to an ArrowTable on disk. The 
resulting file can be bigger than RAM so I want to use a zero copy approach 
when it is subsequently consumed in a Python script, to_pandas(), etc.

My prototype seems to be working. I build the table following the "Row to 
columnar conversion" example. Then I open a FileOutputStream, get a 
RecordBatchWriter from arrow::ipc::NewFileWriter(), and call WriteTable() on it 
to save the table to disk. I get my desired zero copy usage of it in Python.

My question is how do I build that table on disk in a scalable way? Like when 
there are billions of records? Maybe the ArrayBuilder() classes themselves are 
already managing the business of spilling over available RAM? Do I need to 
point up front to a MemoryMappedFile or use a MemoryManager to make this scale? 
Maybe I am done already, but it seems like I'll need to break up the content 
added to the ArrayBuilder() classes while also only adding one chunk to the 
ChunkedArray(s) for zero copy compatibility.

Please help me understand conceptually the best way to write out a bigger than 
RAM Arrow table. A modified "Row to columnar conversion" example would be great 
as well if applicable.

Thank you in advance for your help!

Cheers,
Jeff

[C++] Writing a large Arrow table to disk

Reply via email to