Thanks XQ and Evan. I am going to try it out. Thanks for your suggestions.

Regards,
Sumit Desai

On Sat, Dec 23, 2023 at 12:16 AM Evan Galpin <egal...@apache.org> wrote:

> I assume from the previous messages that GCP Dataflow is being used as the
> pipeline runner.  Even without Flex Templates, the v2 runner can use docker
> containers to install all dependencies from various sources[1].  I have
> used docker containers to solve the same problem you mention: installing a
> python dependency from a private package repository.  The process is
> roughly:
>
>
>    1. Build a docker container from the apache beam base images,
>    customizing as you need[2]
>    2. Tag and push that image to Google Container Registry
>    3. When you deploy your Dataflow job, include the options
>    "--experiment=use_runner_v2 --worker_harness_container_image=
>    gcr.io/my-project/my-image-name:my-image-tag" (there may be other
>    ways, but this is what I have seen working first-hand)
>
> Your docker file can be as simple as:
>
> # Python:major:minor-slim must match apache/beam_python[major:minor]_sdk
> FROM python:3.10-slim
>
> # authenticate with private python package repo, install all various
> # dependencies, set env vars, COPY your pipeline code to the container, etc
> #
> #  ...
> #
> #
>
> # Copy files from official SDK image, including script/dependencies.
> # Apache SDK version must match python image major:minor version
> # Based on
> https://cloud.google.com/dataflow/docs/guides/using-custom-containers#python_1
> COPY --from=apache/beam_python3.10_sdk:2.52.0  /opt/apache/beam
> /opt/apache/beam
>
> # Set the entrypoint to Apache Beam SDK launcher.
> ENTRYPOINT ["/opt/apache/beam/boot"]
>
> [1]
> https://cloud.google.com/dataflow/docs/guides/using-custom-containers#python_1
> [2]
> https://cloud.google.com/dataflow/docs/guides/build-container-image#python
>
>
> On Fri, Dec 22, 2023 at 6:32 AM XQ Hu via user <user@beam.apache.org>
> wrote:
>
>> You can use the same docker image for both template launcher and Dataflow
>> job. Here is one example:
>> https://github.com/google/dataflow-ml-starter/blob/main/tensorflow_gpu.flex.Dockerfile#L60
>>
>> On Fri, Dec 22, 2023 at 8:04 AM Sumit Desai <sumit.de...@uplight.com>
>> wrote:
>>
>>> Yes, I will have to try it out.
>>>
>>> Regards
>>> Sumit Desai
>>>
>>> On Fri, Dec 22, 2023 at 3:53 PM Sofia’s World <mmistr...@gmail.com>
>>> wrote:
>>>
>>>> I guess so, i am not an expert on using env variables in dataflow
>>>> pipelines as any config dependencies i  need, i pass them as job input
>>>> params
>>>>
>>>> But perhaps you can configure variables in your docker file (i am not
>>>> an expert in this either),  as  flex templates use Docker?
>>>>
>>>>
>>>> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates
>>>>
>>>> hth
>>>>   Marco
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Dec 22, 2023 at 10:17 AM Sumit Desai <sumit.de...@uplight.com>
>>>> wrote:
>>>>
>>>>> We are using an external non-public package which expects
>>>>> environmental variables only. If environmental variables are not found, it
>>>>> will throw an error. We can't change source of this package.
>>>>>
>>>>> Does this mean we will face same problem with flex templates also?
>>>>>
>>>>> On Fri, 22 Dec 2023, 3:39 pm Sofia’s World, <mmistr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The flex template will allow you to pass input params with dynamic
>>>>>> values to your data flow job so you could replace the env variable with
>>>>>> that input? That is, unless you have to have env bars..but from your
>>>>>> snippets it appears you are just using them to configure one of your
>>>>>> components?
>>>>>> Hth
>>>>>>
>>>>>> On Fri, 22 Dec 2023, 10:01 Sumit Desai, <sumit.de...@uplight.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Sofia and XQ,
>>>>>>>
>>>>>>> The application is failing because I have loggers defined in every
>>>>>>> file and the method to create a logger tries to create an object of
>>>>>>> UplightTelemetry. If I use flex templated, will the environmental 
>>>>>>> variables
>>>>>>> I supply be loaded before the application gets loaded? If not, it would 
>>>>>>> not
>>>>>>> serve my purpose.
>>>>>>>
>>>>>>> Thanks & Regards,
>>>>>>> Sumit Desai
>>>>>>>
>>>>>>> On Thu, Dec 21, 2023 at 10:02 AM Sumit Desai <
>>>>>>> sumit.de...@uplight.com> wrote:
>>>>>>>
>>>>>>>> Thank you HQ. Will take a look at this.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Sumit Desai
>>>>>>>>
>>>>>>>> On Wed, Dec 20, 2023 at 8:13 PM XQ Hu <x...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Dataflow VMs cannot know your local env variable. I think you
>>>>>>>>> should use custom container:
>>>>>>>>> https://cloud.google.com/dataflow/docs/guides/using-custom-containers.
>>>>>>>>> Here is a sample project:
>>>>>>>>> https://github.com/google/dataflow-ml-starter
>>>>>>>>>
>>>>>>>>> On Wed, Dec 20, 2023 at 4:57 AM Sofia’s World <mmistr...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Sumit
>>>>>>>>>>  Thanks. Sorry...I guess if the value of the env variable is
>>>>>>>>>> always the same u can pass it as job params?..though it doesn't 
>>>>>>>>>> sound like
>>>>>>>>>> a viable option...
>>>>>>>>>> Hth
>>>>>>>>>>
>>>>>>>>>> On Wed, 20 Dec 2023, 09:49 Sumit Desai, <sumit.de...@uplight.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Sofia,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the response. For now, we have decided not to use
>>>>>>>>>>> flex template. Is there a way to pass environmental variables 
>>>>>>>>>>> without using
>>>>>>>>>>> any template?
>>>>>>>>>>>
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sumit Desai
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 20, 2023 at 3:16 PM Sofia’s World <
>>>>>>>>>>> mmistr...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi
>>>>>>>>>>>>  My 2 cents. .have u ever considered using flex templates to
>>>>>>>>>>>> run your pipeline? Then you can pass all your parameters at 
>>>>>>>>>>>> runtime..
>>>>>>>>>>>> (Apologies in advance if it does not cover your use case...)
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 20 Dec 2023, 09:35 Sumit Desai via user, <
>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a Python application which is using Apache beam and
>>>>>>>>>>>>> Dataflow as runner. The application uses a non-public Python 
>>>>>>>>>>>>> package
>>>>>>>>>>>>> 'uplight-telemetry' which is configured using 'extra_packages' 
>>>>>>>>>>>>> while
>>>>>>>>>>>>> creating pipeline_options object. This package expects an 
>>>>>>>>>>>>> environmental
>>>>>>>>>>>>> variable named 'OTEL_SERVICE_NAME' and since this variable is not 
>>>>>>>>>>>>> present
>>>>>>>>>>>>> in the Dataflow worker, it is resulting in an error during 
>>>>>>>>>>>>> application
>>>>>>>>>>>>> startup.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am passing this variable using custom pipeline options. Code
>>>>>>>>>>>>> to create pipeline options is as follows-
>>>>>>>>>>>>>
>>>>>>>>>>>>> pipeline_options = ProcessBillRequests.CustomOptions(
>>>>>>>>>>>>>     project=gcp_project_id,
>>>>>>>>>>>>>     region="us-east1",
>>>>>>>>>>>>>     job_name=job_name,
>>>>>>>>>>>>>     
>>>>>>>>>>>>> temp_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>>>>>>>>>>>>     
>>>>>>>>>>>>> staging_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>>>>>>>>>>>>     runner='DataflowRunner',
>>>>>>>>>>>>>     save_main_session=True,
>>>>>>>>>>>>>     service_account_email= service_account,
>>>>>>>>>>>>>     subnetwork=os.environ.get(SUBNETWORK_URL),
>>>>>>>>>>>>>     extra_packages=[uplight_telemetry_tar_file_path],
>>>>>>>>>>>>>     setup_file=setup_file_path,
>>>>>>>>>>>>>     OTEL_SERVICE_NAME=otel_service_name,
>>>>>>>>>>>>>     OTEL_RESOURCE_ATTRIBUTES=otel_resource_attributes
>>>>>>>>>>>>>     # Set values for additional custom variables as needed
>>>>>>>>>>>>> )
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> And the code that executes the pipeline is as follows-
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> result = (
>>>>>>>>>>>>>         pipeline
>>>>>>>>>>>>>         | "ReadPendingRecordsFromDB" >> read_from_db
>>>>>>>>>>>>>         | "Parse input PCollection" >> 
>>>>>>>>>>>>> beam.Map(ProcessBillRequests.parse_bill_data_requests)
>>>>>>>>>>>>>         | "Fetch bills " >> 
>>>>>>>>>>>>> beam.ParDo(ProcessBillRequests.FetchBillInformation())
>>>>>>>>>>>>> )
>>>>>>>>>>>>>
>>>>>>>>>>>>> pipeline.run().wait_until_finish()
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a way I can set the environmental variables in custom
>>>>>>>>>>>>> options available in the worker?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sumit Desai
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> On Wed, Dec 20, 2023 at 8:13 PM XQ Hu <x...@google.com> wrote:
>>>>>>>
>>>>>>>> Dataflow VMs cannot know your local env variable. I think you
>>>>>>>> should use custom container:
>>>>>>>> https://cloud.google.com/dataflow/docs/guides/using-custom-containers.
>>>>>>>> Here is a sample project:
>>>>>>>> https://github.com/google/dataflow-ml-starter
>>>>>>>>
>>>>>>>> On Wed, Dec 20, 2023 at 4:57 AM Sofia’s World <mmistr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Sumit
>>>>>>>>>  Thanks. Sorry...I guess if the value of the env variable is
>>>>>>>>> always the same u can pass it as job params?..though it doesn't sound 
>>>>>>>>> like
>>>>>>>>> a viable option...
>>>>>>>>> Hth
>>>>>>>>>
>>>>>>>>> On Wed, 20 Dec 2023, 09:49 Sumit Desai, <sumit.de...@uplight.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sofia,
>>>>>>>>>>
>>>>>>>>>> Thanks for the response. For now, we have decided not to use flex
>>>>>>>>>> template. Is there a way to pass environmental variables without 
>>>>>>>>>> using any
>>>>>>>>>> template?
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sumit Desai
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 20, 2023 at 3:16 PM Sofia’s World <
>>>>>>>>>> mmistr...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>  My 2 cents. .have u ever considered using flex templates to run
>>>>>>>>>>> your pipeline? Then you can pass all your parameters at runtime..
>>>>>>>>>>> (Apologies in advance if it does not cover your use case...)
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 20 Dec 2023, 09:35 Sumit Desai via user, <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a Python application which is using Apache beam and
>>>>>>>>>>>> Dataflow as runner. The application uses a non-public Python 
>>>>>>>>>>>> package
>>>>>>>>>>>> 'uplight-telemetry' which is configured using 'extra_packages' 
>>>>>>>>>>>> while
>>>>>>>>>>>> creating pipeline_options object. This package expects an 
>>>>>>>>>>>> environmental
>>>>>>>>>>>> variable named 'OTEL_SERVICE_NAME' and since this variable is not 
>>>>>>>>>>>> present
>>>>>>>>>>>> in the Dataflow worker, it is resulting in an error during 
>>>>>>>>>>>> application
>>>>>>>>>>>> startup.
>>>>>>>>>>>>
>>>>>>>>>>>> I am passing this variable using custom pipeline options. Code
>>>>>>>>>>>> to create pipeline options is as follows-
>>>>>>>>>>>>
>>>>>>>>>>>> pipeline_options = ProcessBillRequests.CustomOptions(
>>>>>>>>>>>>     project=gcp_project_id,
>>>>>>>>>>>>     region="us-east1",
>>>>>>>>>>>>     job_name=job_name,
>>>>>>>>>>>>     
>>>>>>>>>>>> temp_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>>>>>>>>>>>     
>>>>>>>>>>>> staging_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>>>>>>>>>>>     runner='DataflowRunner',
>>>>>>>>>>>>     save_main_session=True,
>>>>>>>>>>>>     service_account_email= service_account,
>>>>>>>>>>>>     subnetwork=os.environ.get(SUBNETWORK_URL),
>>>>>>>>>>>>     extra_packages=[uplight_telemetry_tar_file_path],
>>>>>>>>>>>>     setup_file=setup_file_path,
>>>>>>>>>>>>     OTEL_SERVICE_NAME=otel_service_name,
>>>>>>>>>>>>     OTEL_RESOURCE_ATTRIBUTES=otel_resource_attributes
>>>>>>>>>>>>     # Set values for additional custom variables as needed
>>>>>>>>>>>> )
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> And the code that executes the pipeline is as follows-
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> result = (
>>>>>>>>>>>>         pipeline
>>>>>>>>>>>>         | "ReadPendingRecordsFromDB" >> read_from_db
>>>>>>>>>>>>         | "Parse input PCollection" >> 
>>>>>>>>>>>> beam.Map(ProcessBillRequests.parse_bill_data_requests)
>>>>>>>>>>>>         | "Fetch bills " >> 
>>>>>>>>>>>> beam.ParDo(ProcessBillRequests.FetchBillInformation())
>>>>>>>>>>>> )
>>>>>>>>>>>>
>>>>>>>>>>>> pipeline.run().wait_until_finish()
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a way I can set the environmental variables in custom
>>>>>>>>>>>> options available in the worker?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sumit Desai
>>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to