Thanks Austin & Chad, but my use case is to use beam to do ETL workflow control, which seems different from your case. I would like to check whether anyone has used beam for this kind of use case and whether beam is a good choice.
On Sat, Dec 23, 2023 at 12:58 AM Chad Dombrova <[email protected]> wrote: > Hi, > I'm the guy who gave the Movie Magic talk. Since it's possible to write > stateful transforms with Beam, it is capable of some very sophisticated > flow control. I've not seen a python framework that combines this with > streaming data nearly as well. That said, there aren't a lot of great > working examples out there for transforms that do sophisticated flow > control, and I feel like we're always wrestling with differences in > behavior between the direct runner and Dataflow. There was a thread about > polling patterns [1] on this list that never really got a satisfying > resolution. Likewise, there was a thread about using an SDF with an > unbound source [2] that also didn't get fully resolved. > > [1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw > [2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb > > > > On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett <[email protected]> wrote: > >> https://beamsummit.org/sessions/event-driven-movie-magic/ >> >> ^^ the question made me think of that use case. Though, unclear how >> close it is to what you're thinking about. >> >> Cheers - >> >> On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user < >> [email protected]> wrote: >> >>> As Jan says, theoretically possible? Sure. That particular set of >>> operations? Overkill. If you don't have it already set up I'd say even >>> something like Airflow is overkill here. If all you need to do is "launch >>> job and wait" when a file arrives... that's a small script and not >>> something that particularly requires a distributed data processing system. >>> >>> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> Apache Beam describes itself as "Apache Beam is an open-source, unified >>>> programming model for batch and streaming data processing pipelines, ...". >>>> As such, it is possible to use it to express essentially arbitrary logic >>>> and run it as a streaming pipeline. A streaming pipeline processes input >>>> data and produces output data and/or actions. Given these assumptions, it >>>> is technically feasible to use Apache Beam for orchestrating other >>>> workflows, the problem is that it will very much likely not be efficient. >>>> Apache Beam has a lot of heavy-lifting related to the fact it is designed >>>> to process large volumes of data in a scalable way, which is probably not >>>> what would one need for workflow orchestration. So, my two cents would be, >>>> that although it _could_ be done, it probably _should not_ be done. >>>> >>>> Best, >>>> >>>> Jan >>>> On 12/15/23 13:39, Mikhail Khludnev wrote: >>>> >>>> Hello, >>>> I think this page >>>> https://beam.apache.org/documentation/ml/orchestration/ might answer >>>> your question. >>>> Frankly speaking: GCP Workflows and Apache Airflow. >>>> But Beam itself is a data-stream/flow or batch processor; not a >>>> workflow engine (IMHO). >>>> >>>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <[email protected]> >>>> wrote: >>>> >>>>> I know it is technically possible, but my case may be a little >>>>> special. Say I have 3 steps for my control flow (ETL workflow): >>>>> Step 1. upstream file watching >>>>> Step 2. call some external service to run one job, e.g. run a >>>>> notebook, run a python script >>>>> Step 3. notify downstream workflow >>>>> Can I use apache beam to build a DAG with 3 nodes and run this as >>>>> either flink or spark job. It might be a little weird, but I just want to >>>>> learn from the community whether this is the right way to use apache beam, >>>>> and has anyone done this before? Thanks >>>>> >>>>> >>>>> >>>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user < >>>>> [email protected]> wrote: >>>>> >>>>>> It’s technically possible but the closest thing I can think of would >>>>>> be triggering things based on things like file watching. >>>>>> >>>>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Not using beam as time-based scheduler, but just use it to control >>>>>>> execution orders of ETL workflow DAG, because beam's abstraction is >>>>>>> also a >>>>>>> DAG. >>>>>>> I know it is a little weird, just want to confirm with the >>>>>>> community, has anyone used beam like this before? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> can you give an example of what you mean for better understanding? >>>>>>>> Do >>>>>>>> you mean using Beam as a scheduler of other ETL workflows? >>>>>>>> >>>>>>>> Jan >>>>>>>> >>>>>>>> On 12/14/23 13:17, data_nerd_666 wrote: >>>>>>>> > Hi all, >>>>>>>> > >>>>>>>> > I am new to apache beam, and am very excited to find beam in >>>>>>>> apache >>>>>>>> > community. I see lots of use cases of using apache beam for data >>>>>>>> flow >>>>>>>> > (process large amount of batch/streaming data). I am just >>>>>>>> wondering >>>>>>>> > whether I can use apache beam for control flow (ETL workflow). I >>>>>>>> don't >>>>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL >>>>>>>> workflow >>>>>>>> > itself. Because ETL workflow is also a DAG which is very similar >>>>>>>> as >>>>>>>> > the abstraction of apache beam, but unfortunately I didn't find >>>>>>>> such >>>>>>>> > use cases on internet. So I'd like to ask this question in beam >>>>>>>> > community to confirm whether I can use apache beam for control >>>>>>>> flow >>>>>>>> > (ETL workflow). If yes, please let me know some success stories >>>>>>>> of >>>>>>>> > this. Thanks >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>> >>>> >>>> -- >>>> Sincerely yours >>>> Mikhail Khludnev >>>> >>>>
