Specifying dataflow template location with Apache beam Python SDK

2023-12-17 Thread Sumit Desai via user
I am creating an Apache beam pipeline using Python SDK.I want to use some
standard template of dataflow (this one
).
But when I am specifying it using 'template_location' key while creating
pipeline_options object, I am getting an error `FileNotFoundError: [Errno
2] No such file or directory: '
gcr.io/dataflow-templates-base/python310-template-launcher-base'`

I also tried to specify the complete version `
gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
but got the same error. Can someone suggest what I might be doing wrong?
The code snippet to create pipeline_options is as follows-

def __create_pipeline_options_dataflow(job_name):


# Set up the Dataflow runner options
gcp_project_id = os.environ.get(GCP_PROJECT_ID)
# TODO:Move to environmental variables
pipeline_options = {
'project': gcp_project_id,
'region': "us-east1",
'job_name': job_name,  # Provide a unique job name
'temp_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
'staging_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
'runner': 'DataflowRunner',
'save_main_session': True,
'service_account_email': service_account,
# 'network': f'projects/{gcp_project_id}/global/networks/default',
# 'subnetwork':
f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
'template_location': '
gcr.io/dataflow-templates-base/python310-template-launcher-base'

}
logger.debug(f"pipeline_options created as {pipeline_options}")
return pipeline_options


Re: Can apache beam be used for control flow (ETL workflow)

2023-12-17 Thread Austin Bennett
https://beamsummit.org/sessions/event-driven-movie-magic/

^^ the question made me think of that use case.  Though, unclear how close
it is to what you're thinking about.

Cheers -

On Fri, Dec 15, 2023 at 7:01 AM Byron Ellis via user 
wrote:

> As Jan says, theoretically possible? Sure. That particular set of
> operations? Overkill. If you don't have it already set up I'd say even
> something like Airflow is overkill here. If all you need to do is "launch
> job and wait" when a file arrives... that's a small script and not
> something that particularly requires a distributed data processing system.
>
> On Fri, Dec 15, 2023 at 4:58 AM Jan Lukavský  wrote:
>
>> Hi,
>>
>> Apache Beam describes itself as "Apache Beam is an open-source, unified
>> programming model for batch and streaming data processing pipelines, ...".
>> As such, it is possible to use it to express essentially arbitrary logic
>> and run it as a streaming pipeline. A streaming pipeline processes input
>> data and produces output data and/or actions. Given these assumptions, it
>> is technically feasible to use Apache Beam for orchestrating other
>> workflows, the problem is that it will very much likely not be efficient.
>> Apache Beam has a lot of heavy-lifting related to the fact it is designed
>> to process large volumes of data in a scalable way, which is probably not
>> what would one need for workflow orchestration. So, my two cents would be,
>> that although it _could_ be done, it probably _should not_ be done.
>>
>> Best,
>>
>>  Jan
>> On 12/15/23 13:39, Mikhail Khludnev wrote:
>>
>> Hello,
>> I think this page https://beam.apache.org/documentation/ml/orchestration/
>> might answer your question.
>> Frankly speaking: GCP Workflows and Apache Airflow.
>> But Beam itself is a data-stream/flow or batch processor; not a workflow
>> engine (IMHO).
>>
>> On Fri, Dec 15, 2023 at 3:13 PM data_nerd_666 
>> wrote:
>>
>>> I know it is technically possible, but my case may be a little special.
>>> Say I have 3 steps for my control flow (ETL workflow):
>>> Step 1. upstream file watching
>>> Step 2. call some external service to run one job, e.g. run a notebook,
>>> run a python script
>>> Step 3. notify downstream workflow
>>> Can I use apache beam to build a DAG with 3 nodes and run this as either
>>> flink or spark job.  It might be a little weird, but I just want to
>>> learn from the community whether this is the right way to use apache beam,
>>> and has anyone done this before? Thanks
>>>
>>>
>>>
>>> On Fri, Dec 15, 2023 at 10:28 AM Byron Ellis via user <
>>> user@beam.apache.org> wrote:
>>>
 It’s technically possible but the closest thing I can think of would be
 triggering things based on things like file watching.

 On Thu, Dec 14, 2023 at 2:46 PM data_nerd_666 
 wrote:

> Not using beam as time-based scheduler, but just use it to control
> execution orders of ETL workflow DAG, because beam's abstraction is also a
> DAG.
> I know it is a little weird, just want to confirm with the community,
> has anyone used beam like this before?
>
>
>
> On Thu, Dec 14, 2023 at 10:59 PM Jan Lukavský  wrote:
>
>> Hi,
>>
>> can you give an example of what you mean for better understanding? Do
>> you mean using Beam as a scheduler of other ETL workflows?
>>
>>   Jan
>>
>> On 12/14/23 13:17, data_nerd_666 wrote:
>> > Hi all,
>> >
>> > I am new to apache beam, and am very excited to find beam in apache
>> > community. I see lots of use cases of using apache beam for data
>> flow
>> > (process large amount of batch/streaming data). I am just wondering
>> > whether I can use apache beam for control flow (ETL workflow). I
>> don't
>> > mean the spark/flink job in the ETL workflow, I mean the ETL
>> workflow
>> > itself. Because ETL workflow is also a DAG which is very similar as
>> > the abstraction of apache beam, but unfortunately I didn't find
>> such
>> > use cases on internet. So I'd like to ask this question in beam
>> > community to confirm whether I can use apache beam for control flow
>> > (ETL workflow). If yes, please let me know some success stories of
>> > this. Thanks
>> >
>> >
>> >
>>
>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>>