[C++][Parquet] Extend RcordBatchReader to write parquet from streaming data

Haocheng Liu Thu, 18 May 2023 15:27:19 -0700

Hi,

I have a question on how to use the Acero push model to write streaming
data as hive partitioning Parquet in a single thread program. Can anyone
guide what's the best practice here and if my below understandings are
correct:


   - I receive streaming data via a callback function which gives me data
   row by row. To my best knowledge, Subclassing RecordBatchReader is
   preferred?
   - Should I batch a fixed number rows in some in memory data structure
   first, then flush them to acero? Then how could acero know it's time to
   push data in ReadNext
   
<https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow17RecordBatchReader8ReadNextEPNSt10shared_ptrI11RecordBatchEE>
    function?

I'm not clear on how to connect a call back function from streaming data
with Aecro push model. Any suggestions will be appreciated.


Thanks.

Best,
Haocheng

[C++][Parquet] Extend RcordBatchReader to write parquet from streaming data

Reply via email to