Thanks Austin & Chad, but my use case is to use beam to do ETL workflow
control, which seems different from your case. I would like to check
whether anyone has used beam for this kind of use case and whether beam is
a good choice.

On Sat, Dec 23, 2023 at 12:58 AM Chad Dombrova <[email protected]> wrote:

> Hi,
> I'm the guy who gave the Movie Magic talk.  Since it's possible to write
> stateful transforms with Beam, it is capable of some very sophisticated
> flow control.   I've not seen a python framework that combines this with
> streaming data nearly as well.  That said, there aren't a lot of great
> working examples out there for transforms that do sophisticated flow
> control, and I feel like we're always wrestling with differences in
> behavior between the direct runner and Dataflow.  There was a thread about
> polling patterns [1] on this list that never really got a satisfying
> resolution.  Likewise, there was a thread about using an SDF with an
> unbound source [2] that also didn't get fully resolved.
>
> [1] https://lists.apache.org/thread/nsxs49vjokcc5wkvdvbvsqwzq682s7qw
> [2] https://lists.apache.org/thread/n3xgml0z8fok7101q79rsmdgp06lofnb
>
>
>
> On Sun, Dec 17, 2023 at 3:53 PM Austin Bennett <[email protected]> wrote:
>
>> https://beamsummit.org/sessions/event-driven-movie-magic/
>>
>> ^^ the question made me think of that use case.  Though, unclear how
>> close it is to what you're thinking about.
>>
>> Cheers -
>>
>> On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user <
>> [email protected]> wrote:
>>
>>> As Jan says, theoretically possible? Sure. That particular set of
>>> operations? Overkill. If you don't have it already set up I'd say even
>>> something like Airflow is overkill here. If all you need to do is "launch
>>> job and wait" when a file arrives... that's a small script and not
>>> something that particularly requires a distributed data processing system.
>>>
>>> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Apache Beam describes itself as "Apache Beam is an open-source, unified
>>>> programming model for batch and streaming data processing pipelines, ...".
>>>> As such, it is possible to use it to express essentially arbitrary logic
>>>> and run it as a streaming pipeline. A streaming pipeline processes input
>>>> data and produces output data and/or actions. Given these assumptions, it
>>>> is technically feasible to use Apache Beam for orchestrating other
>>>> workflows, the problem is that it will very much likely not be efficient.
>>>> Apache Beam has a lot of heavy-lifting related to the fact it is designed
>>>> to process large volumes of data in a scalable way, which is probably not
>>>> what would one need for workflow orchestration. So, my two cents would be,
>>>> that although it _could_ be done, it probably _should not_ be done.
>>>>
>>>> Best,
>>>>
>>>>  Jan
>>>> On 12/15/23 13:39, Mikhail Khludnev wrote:
>>>>
>>>> Hello,
>>>> I think this page
>>>> https://beam.apache.org/documentation/ml/orchestration/ might answer
>>>> your question.
>>>> Frankly speaking: GCP Workflows and Apache Airflow.
>>>> But Beam itself is a data-stream/flow or batch processor; not a
>>>> workflow engine (IMHO).
>>>>
>>>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <[email protected]>
>>>> wrote:
>>>>
>>>>> I know it is technically possible, but my case may be a little
>>>>> special. Say I have 3 steps for my control flow (ETL workflow):
>>>>> Step 1. upstream file watching
>>>>> Step 2. call some external service to run one job, e.g. run a
>>>>> notebook, run a python script
>>>>> Step 3. notify downstream workflow
>>>>> Can I use apache beam to build a DAG with 3 nodes and run this as
>>>>> either flink or spark job.  It might be a little weird, but I just want to
>>>>> learn from the community whether this is the right way to use apache beam,
>>>>> and has anyone done this before? Thanks
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> It’s technically possible but the closest thing I can think of would
>>>>>> be triggering things based on things like file watching.
>>>>>>
>>>>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Not using beam as time-based scheduler, but just use it to control
>>>>>>> execution orders of ETL workflow DAG, because beam's abstraction is 
>>>>>>> also a
>>>>>>> DAG.
>>>>>>> I know it is a little weird, just want to confirm with the
>>>>>>> community, has anyone used beam like this before?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> can you give an example of what you mean for better understanding?
>>>>>>>> Do
>>>>>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>>>>>
>>>>>>>>   Jan
>>>>>>>>
>>>>>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>>>>>> > Hi all,
>>>>>>>> >
>>>>>>>> > I am new to apache beam, and am very excited to find beam in
>>>>>>>> apache
>>>>>>>> > community. I see lots of use cases of using apache beam for data
>>>>>>>> flow
>>>>>>>> > (process large amount of batch/streaming data). I am just
>>>>>>>> wondering
>>>>>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>>>>>> don't
>>>>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>>>>>>>> workflow
>>>>>>>> > itself. Because ETL workflow is also a DAG which is very similar
>>>>>>>> as
>>>>>>>> > the abstraction of apache beam, but unfortunately I didn't find
>>>>>>>> such
>>>>>>>> > use cases on internet. So I'd like to ask this question in beam
>>>>>>>> > community to confirm whether I can use apache beam for control
>>>>>>>> flow
>>>>>>>> > (ETL workflow). If yes, please let me know some success stories
>>>>>>>> of
>>>>>>>> > this. Thanks
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>>
>>>>

Reply via email to