Hi Weston, Thanks for the detailed reply! I will explore the example you provided.
> - You should block until sufficient data is available. Can you plz elaborate on how to block in RecordBatchReader subclass? Downstream nodes will only pull when Flush(...) is called? ``` std::shared_ptr<arrow::RecordBatch> batch;516 ARROW_ASSIGN_OR_RAISE(batch, batch_builder->Flush()); ``` Thanks in advance. Regards, Haocheng On Fri, May 19, 2023 at 9:03 AM Weston Pace <[email protected]> wrote: > > Can anyone guide what's the best practice here and if my > below understandings are correct: > > Here's an example: > https://gist.github.com/westonpace/6f7fdbdc0399501418101851d75091c4 > > > I receive streaming data via a callback function which gives me data row > by row. To my best knowledge, Subclassing RecordBatchReader is preferred? > > RecordBatchReader is pull-based. A friendly push-based source node for > Acero would be a good idea, but doesn't exist. You can make one using a > PushGenerator but that might be complicated. In the meantime, subclassing > RecordBatchReader is probably simplest. > > > Should I batch a fixed number rows in some in memory data structure > first, then flush them to acero? > > Yes, for performance. > > > Then how could acero know it's time to push data in ReadNext > <https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow17RecordBatchReader8ReadNextEPNSt10shared_ptrI11RecordBatchEE> > function? > > ReadNext is pull based. Acero will continually call ReadNext. You should > block until sufficient data is available. > > > I have a question on how to use the Acero push model to write streaming > data as hive partitioning Parquet in a single thread program > > The example I gave will use one thread in addition to the main thread. > Doing something without any additional threads at all would be possible, > but would probably require knowing more about the structure of your program. > > On Thu, May 18, 2023 at 3:27 PM Haocheng Liu <[email protected]> wrote: > >> Hi, >> >> I have a question on how to use the Acero push model to write streaming >> data as hive partitioning Parquet in a single thread program. Can anyone >> guide what's the best practice here and if my below understandings are >> correct: >> >> - I receive streaming data via a callback function which gives me >> data row by row. To my best knowledge, Subclassing RecordBatchReader is >> preferred? >> - Should I batch a fixed number rows in some in memory data structure >> first, then flush them to acero? Then how could acero know it's time to >> push data in ReadNext >> >> <https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow17RecordBatchReader8ReadNextEPNSt10shared_ptrI11RecordBatchEE> >> function? >> >> I'm not clear on how to connect a call back function from streaming data >> with Aecro push model. Any suggestions will be appreciated. >> >> >> Thanks. >> >> Best, >> Haocheng >> > -- Best regards
