As Jan says, theoretically possible? Sure. That particular set of
operations? Overkill. If you don't have it already set up I'd say even
something like Airflow is overkill here. If all you need to do is "launch
job and wait" when a file arrives... that's a small script and not
something that particularly requires a distributed data processing system.

On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský <[email protected]> wrote:

> Hi,
>
> Apache Beam describes itself as "Apache Beam is an open-source, unified
> programming model for batch and streaming data processing pipelines, ...".
> As such, it is possible to use it to express essentially arbitrary logic
> and run it as a streaming pipeline. A streaming pipeline processes input
> data and produces output data and/or actions. Given these assumptions, it
> is technically feasible to use Apache Beam for orchestrating other
> workflows, the problem is that it will very much likely not be efficient.
> Apache Beam has a lot of heavy-lifting related to the fact it is designed
> to process large volumes of data in a scalable way, which is probably not
> what would one need for workflow orchestration. So, my two cents would be,
> that although it _could_ be done, it probably _should not_ be done.
>
> Best,
>
>  Jan
> On 12/15/23 13:39, Mikhail Khludnev wrote:
>
> Hello,
> I think this page https://beam.apache.org/documentation/ml/orchestration/
> might answer your question.
> Frankly speaking: GCP Workflows and Apache Airflow.
> But Beam itself is a data-stream/flow or batch processor; not a workflow
> engine (IMHO).
>
> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 <[email protected]>
> wrote:
>
>> I know it is technically possible, but my case may be a little special.
>> Say I have 3 steps for my control flow (ETL workflow):
>> Step 1. upstream file watching
>> Step 2. call some external service to run one job, e.g. run a notebook,
>> run a python script
>> Step 3. notify downstream workflow
>> Can I use apache beam to build a DAG with 3 nodes and run this as either
>> flink or spark job.  It might be a little weird, but I just want to
>> learn from the community whether this is the right way to use apache beam,
>> and has anyone done this before? Thanks
>>
>>
>>
>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>> [email protected]> wrote:
>>
>>> It’s technically possible but the closest thing I can think of would be
>>> triggering things based on things like file watching.
>>>
>>> On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 <[email protected]>
>>> wrote:
>>>
>>>> Not using beam as time-based scheduler, but just use it to control
>>>> execution orders of ETL workflow DAG, because beam's abstraction is also a
>>>> DAG.
>>>> I know it is a little weird, just want to confirm with the community,
>>>> has anyone used beam like this before?
>>>>
>>>>
>>>>
>>>> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> can you give an example of what you mean for better understanding? Do
>>>>> you mean using Beam as a scheduler of other ETL workflows?
>>>>>
>>>>>   Jan
>>>>>
>>>>> On 12/14/23 13:17, data_nerd_666 wrote:
>>>>> > Hi all,
>>>>> >
>>>>> > I am new to apache beam, and am very excited to find beam in apache
>>>>> > community. I see lots of use cases of using apache beam for data
>>>>> flow
>>>>> > (process large amount of batch/streaming data). I am just wondering
>>>>> > whether I can use apache beam for control flow (ETL workflow). I
>>>>> don't
>>>>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>>>>> workflow
>>>>> > itself. Because ETL workflow is also a DAG which is very similar as
>>>>> > the abstraction of apache beam, but unfortunately I didn't find such
>>>>> > use cases on internet. So I'd like to ask this question in beam
>>>>> > community to confirm whether I can use apache beam for control flow
>>>>> > (ETL workflow). If yes, please let me know some success stories of
>>>>> > this. Thanks
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>

Reply via email to