Thanks XQ and Evan. I am going to try it out. Thanks for your suggestions. Regards, Sumit Desai
On Sat, Dec 23, 2023 at 12:16 AM Evan Galpin <egal...@apache.org> wrote: > I assume from the previous messages that GCP Dataflow is being used as the > pipeline runner. Even without Flex Templates, the v2 runner can use docker > containers to install all dependencies from various sources[1]. I have > used docker containers to solve the same problem you mention: installing a > python dependency from a private package repository. The process is > roughly: > > > 1. Build a docker container from the apache beam base images, > customizing as you need[2] > 2. Tag and push that image to Google Container Registry > 3. When you deploy your Dataflow job, include the options > "--experiment=use_runner_v2 --worker_harness_container_image= > gcr.io/my-project/my-image-name:my-image-tag" (there may be other > ways, but this is what I have seen working first-hand) > > Your docker file can be as simple as: > > # Python:major:minor-slim must match apache/beam_python[major:minor]_sdk > FROM python:3.10-slim > > # authenticate with private python package repo, install all various > # dependencies, set env vars, COPY your pipeline code to the container, etc > # > # ... > # > # > > # Copy files from official SDK image, including script/dependencies. > # Apache SDK version must match python image major:minor version > # Based on > https://cloud.google.com/dataflow/docs/guides/using-custom-containers#python_1 > COPY --from=apache/beam_python3.10_sdk:2.52.0 /opt/apache/beam > /opt/apache/beam > > # Set the entrypoint to Apache Beam SDK launcher. > ENTRYPOINT ["/opt/apache/beam/boot"] > > [1] > https://cloud.google.com/dataflow/docs/guides/using-custom-containers#python_1 > [2] > https://cloud.google.com/dataflow/docs/guides/build-container-image#python > > > On Fri, Dec 22, 2023 at 6:32 AM XQ Hu via user <user@beam.apache.org> > wrote: > >> You can use the same docker image for both template launcher and Dataflow >> job. Here is one example: >> https://github.com/google/dataflow-ml-starter/blob/main/tensorflow_gpu.flex.Dockerfile#L60 >> >> On Fri, Dec 22, 2023 at 8:04 AM Sumit Desai <sumit.de...@uplight.com> >> wrote: >> >>> Yes, I will have to try it out. >>> >>> Regards >>> Sumit Desai >>> >>> On Fri, Dec 22, 2023 at 3:53 PM Sofia’s World <mmistr...@gmail.com> >>> wrote: >>> >>>> I guess so, i am not an expert on using env variables in dataflow >>>> pipelines as any config dependencies i need, i pass them as job input >>>> params >>>> >>>> But perhaps you can configure variables in your docker file (i am not >>>> an expert in this either), as flex templates use Docker? >>>> >>>> >>>> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates >>>> >>>> hth >>>> Marco >>>> >>>> >>>> >>>> >>>> On Fri, Dec 22, 2023 at 10:17 AM Sumit Desai <sumit.de...@uplight.com> >>>> wrote: >>>> >>>>> We are using an external non-public package which expects >>>>> environmental variables only. If environmental variables are not found, it >>>>> will throw an error. We can't change source of this package. >>>>> >>>>> Does this mean we will face same problem with flex templates also? >>>>> >>>>> On Fri, 22 Dec 2023, 3:39 pm Sofia’s World, <mmistr...@gmail.com> >>>>> wrote: >>>>> >>>>>> The flex template will allow you to pass input params with dynamic >>>>>> values to your data flow job so you could replace the env variable with >>>>>> that input? That is, unless you have to have env bars..but from your >>>>>> snippets it appears you are just using them to configure one of your >>>>>> components? >>>>>> Hth >>>>>> >>>>>> On Fri, 22 Dec 2023, 10:01 Sumit Desai, <sumit.de...@uplight.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Sofia and XQ, >>>>>>> >>>>>>> The application is failing because I have loggers defined in every >>>>>>> file and the method to create a logger tries to create an object of >>>>>>> UplightTelemetry. If I use flex templated, will the environmental >>>>>>> variables >>>>>>> I supply be loaded before the application gets loaded? If not, it would >>>>>>> not >>>>>>> serve my purpose. >>>>>>> >>>>>>> Thanks & Regards, >>>>>>> Sumit Desai >>>>>>> >>>>>>> On Thu, Dec 21, 2023 at 10:02 AM Sumit Desai < >>>>>>> sumit.de...@uplight.com> wrote: >>>>>>> >>>>>>>> Thank you HQ. Will take a look at this. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Sumit Desai >>>>>>>> >>>>>>>> On Wed, Dec 20, 2023 at 8:13 PM XQ Hu <x...@google.com> wrote: >>>>>>>> >>>>>>>>> Dataflow VMs cannot know your local env variable. I think you >>>>>>>>> should use custom container: >>>>>>>>> https://cloud.google.com/dataflow/docs/guides/using-custom-containers. >>>>>>>>> Here is a sample project: >>>>>>>>> https://github.com/google/dataflow-ml-starter >>>>>>>>> >>>>>>>>> On Wed, Dec 20, 2023 at 4:57 AM Sofia’s World <mmistr...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello Sumit >>>>>>>>>> Thanks. Sorry...I guess if the value of the env variable is >>>>>>>>>> always the same u can pass it as job params?..though it doesn't >>>>>>>>>> sound like >>>>>>>>>> a viable option... >>>>>>>>>> Hth >>>>>>>>>> >>>>>>>>>> On Wed, 20 Dec 2023, 09:49 Sumit Desai, <sumit.de...@uplight.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Sofia, >>>>>>>>>>> >>>>>>>>>>> Thanks for the response. For now, we have decided not to use >>>>>>>>>>> flex template. Is there a way to pass environmental variables >>>>>>>>>>> without using >>>>>>>>>>> any template? >>>>>>>>>>> >>>>>>>>>>> Thanks & Regards, >>>>>>>>>>> Sumit Desai >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 20, 2023 at 3:16 PM Sofia’s World < >>>>>>>>>>> mmistr...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi >>>>>>>>>>>> My 2 cents. .have u ever considered using flex templates to >>>>>>>>>>>> run your pipeline? Then you can pass all your parameters at >>>>>>>>>>>> runtime.. >>>>>>>>>>>> (Apologies in advance if it does not cover your use case...) >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 20 Dec 2023, 09:35 Sumit Desai via user, < >>>>>>>>>>>> user@beam.apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> I have a Python application which is using Apache beam and >>>>>>>>>>>>> Dataflow as runner. The application uses a non-public Python >>>>>>>>>>>>> package >>>>>>>>>>>>> 'uplight-telemetry' which is configured using 'extra_packages' >>>>>>>>>>>>> while >>>>>>>>>>>>> creating pipeline_options object. This package expects an >>>>>>>>>>>>> environmental >>>>>>>>>>>>> variable named 'OTEL_SERVICE_NAME' and since this variable is not >>>>>>>>>>>>> present >>>>>>>>>>>>> in the Dataflow worker, it is resulting in an error during >>>>>>>>>>>>> application >>>>>>>>>>>>> startup. >>>>>>>>>>>>> >>>>>>>>>>>>> I am passing this variable using custom pipeline options. Code >>>>>>>>>>>>> to create pipeline options is as follows- >>>>>>>>>>>>> >>>>>>>>>>>>> pipeline_options = ProcessBillRequests.CustomOptions( >>>>>>>>>>>>> project=gcp_project_id, >>>>>>>>>>>>> region="us-east1", >>>>>>>>>>>>> job_name=job_name, >>>>>>>>>>>>> >>>>>>>>>>>>> temp_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp', >>>>>>>>>>>>> >>>>>>>>>>>>> staging_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging', >>>>>>>>>>>>> runner='DataflowRunner', >>>>>>>>>>>>> save_main_session=True, >>>>>>>>>>>>> service_account_email= service_account, >>>>>>>>>>>>> subnetwork=os.environ.get(SUBNETWORK_URL), >>>>>>>>>>>>> extra_packages=[uplight_telemetry_tar_file_path], >>>>>>>>>>>>> setup_file=setup_file_path, >>>>>>>>>>>>> OTEL_SERVICE_NAME=otel_service_name, >>>>>>>>>>>>> OTEL_RESOURCE_ATTRIBUTES=otel_resource_attributes >>>>>>>>>>>>> # Set values for additional custom variables as needed >>>>>>>>>>>>> ) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> And the code that executes the pipeline is as follows- >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> result = ( >>>>>>>>>>>>> pipeline >>>>>>>>>>>>> | "ReadPendingRecordsFromDB" >> read_from_db >>>>>>>>>>>>> | "Parse input PCollection" >> >>>>>>>>>>>>> beam.Map(ProcessBillRequests.parse_bill_data_requests) >>>>>>>>>>>>> | "Fetch bills " >> >>>>>>>>>>>>> beam.ParDo(ProcessBillRequests.FetchBillInformation()) >>>>>>>>>>>>> ) >>>>>>>>>>>>> >>>>>>>>>>>>> pipeline.run().wait_until_finish() >>>>>>>>>>>>> >>>>>>>>>>>>> Is there a way I can set the environmental variables in custom >>>>>>>>>>>>> options available in the worker? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>>> Sumit Desai >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>> On Wed, Dec 20, 2023 at 8:13 PM XQ Hu <x...@google.com> wrote: >>>>>>> >>>>>>>> Dataflow VMs cannot know your local env variable. I think you >>>>>>>> should use custom container: >>>>>>>> https://cloud.google.com/dataflow/docs/guides/using-custom-containers. >>>>>>>> Here is a sample project: >>>>>>>> https://github.com/google/dataflow-ml-starter >>>>>>>> >>>>>>>> On Wed, Dec 20, 2023 at 4:57 AM Sofia’s World <mmistr...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello Sumit >>>>>>>>> Thanks. Sorry...I guess if the value of the env variable is >>>>>>>>> always the same u can pass it as job params?..though it doesn't sound >>>>>>>>> like >>>>>>>>> a viable option... >>>>>>>>> Hth >>>>>>>>> >>>>>>>>> On Wed, 20 Dec 2023, 09:49 Sumit Desai, <sumit.de...@uplight.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Sofia, >>>>>>>>>> >>>>>>>>>> Thanks for the response. For now, we have decided not to use flex >>>>>>>>>> template. Is there a way to pass environmental variables without >>>>>>>>>> using any >>>>>>>>>> template? >>>>>>>>>> >>>>>>>>>> Thanks & Regards, >>>>>>>>>> Sumit Desai >>>>>>>>>> >>>>>>>>>> On Wed, Dec 20, 2023 at 3:16 PM Sofia’s World < >>>>>>>>>> mmistr...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi >>>>>>>>>>> My 2 cents. .have u ever considered using flex templates to run >>>>>>>>>>> your pipeline? Then you can pass all your parameters at runtime.. >>>>>>>>>>> (Apologies in advance if it does not cover your use case...) >>>>>>>>>>> >>>>>>>>>>> On Wed, 20 Dec 2023, 09:35 Sumit Desai via user, < >>>>>>>>>>> user@beam.apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> I have a Python application which is using Apache beam and >>>>>>>>>>>> Dataflow as runner. The application uses a non-public Python >>>>>>>>>>>> package >>>>>>>>>>>> 'uplight-telemetry' which is configured using 'extra_packages' >>>>>>>>>>>> while >>>>>>>>>>>> creating pipeline_options object. This package expects an >>>>>>>>>>>> environmental >>>>>>>>>>>> variable named 'OTEL_SERVICE_NAME' and since this variable is not >>>>>>>>>>>> present >>>>>>>>>>>> in the Dataflow worker, it is resulting in an error during >>>>>>>>>>>> application >>>>>>>>>>>> startup. >>>>>>>>>>>> >>>>>>>>>>>> I am passing this variable using custom pipeline options. Code >>>>>>>>>>>> to create pipeline options is as follows- >>>>>>>>>>>> >>>>>>>>>>>> pipeline_options = ProcessBillRequests.CustomOptions( >>>>>>>>>>>> project=gcp_project_id, >>>>>>>>>>>> region="us-east1", >>>>>>>>>>>> job_name=job_name, >>>>>>>>>>>> >>>>>>>>>>>> temp_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp', >>>>>>>>>>>> >>>>>>>>>>>> staging_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging', >>>>>>>>>>>> runner='DataflowRunner', >>>>>>>>>>>> save_main_session=True, >>>>>>>>>>>> service_account_email= service_account, >>>>>>>>>>>> subnetwork=os.environ.get(SUBNETWORK_URL), >>>>>>>>>>>> extra_packages=[uplight_telemetry_tar_file_path], >>>>>>>>>>>> setup_file=setup_file_path, >>>>>>>>>>>> OTEL_SERVICE_NAME=otel_service_name, >>>>>>>>>>>> OTEL_RESOURCE_ATTRIBUTES=otel_resource_attributes >>>>>>>>>>>> # Set values for additional custom variables as needed >>>>>>>>>>>> ) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> And the code that executes the pipeline is as follows- >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> result = ( >>>>>>>>>>>> pipeline >>>>>>>>>>>> | "ReadPendingRecordsFromDB" >> read_from_db >>>>>>>>>>>> | "Parse input PCollection" >> >>>>>>>>>>>> beam.Map(ProcessBillRequests.parse_bill_data_requests) >>>>>>>>>>>> | "Fetch bills " >> >>>>>>>>>>>> beam.ParDo(ProcessBillRequests.FetchBillInformation()) >>>>>>>>>>>> ) >>>>>>>>>>>> >>>>>>>>>>>> pipeline.run().wait_until_finish() >>>>>>>>>>>> >>>>>>>>>>>> Is there a way I can set the environmental variables in custom >>>>>>>>>>>> options available in the worker? >>>>>>>>>>>> >>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>> Sumit Desai >>>>>>>>>>>> >>>>>>>>>>>