Hi, I think I finally managed to track down the difference - the dataflow job runs correctly when it has the pipeline option tempLocation set (in addition to temp_location). I have been having issues trying to get that field set via the gcloud CLI, but using the python SDK <https://cloud.google.com/dataflow/docs/reference/rpc/google.dataflow.v1beta3#google.dataflow.v1beta3.LaunchTemplateParameters> it is being set with "environment": {"tempLocation": temp_location} and that will launch and execute as expected. I'm not really sure what the difference is.
Thanks, Patrick On Wed, Jan 11, 2023 at 8:45 AM Patrick McQuighan <[email protected]> wrote: > Hi Bruno, > Thanks for the response. The SDK version and all should be identical - > this issue occurs using code from the exact same commit in git, and the > dependencies are frozen. I should mention this is using the python SDK > version 2.39.0. > > Diffing between the templates only appears to show expected differences > e.g. the tempStoragePrefix points to a different bucket, seralized_fns and > windowing_strategy are different in a couple of locations but the > difference is just in the name of a temp directory (e.g. tmpv396o090). And > similarly, cannot see differences in Pipeline options. So I've been > scratching my head trying to figure out what's going on here :/. > > I'll create a ticket with Dataflow support and see if there's something > with the Dataflow runner that might be causing this issue! > > -Patrick > > > > On Tue, Jan 10, 2023 at 8:32 PM Bruno Volpato <[email protected]> wrote: > >> Hi Patrick, >> >> I have a few questions that might help troubleshoot this: >> >> Did you use the same SDK? Have you updated Beam or any other dependencies? >> Are there any other error logs (prior to the trace above) that could help >> understand it? >> Do you still have the previous template so you can compare the contents? >> (they are JSON, so formatting and diffing may be sufficient here.) >> If not, I'd suggest comparing the "Job info" and "Pipeline options" for >> possible environment/parameter changes. >> >> This might be related to a specific runner (Dataflow) rather than the >> SDK, so if the above doesn't help, a good approach may be contacting >> Dataflow support and providing specific job IDs so they can give a better >> look. >> >> Best, >> Bruno >> >> >> >> On Tue, Jan 10, 2023 at 8:42 PM Patrick McQuighan via user < >> [email protected]> wrote: >> >>> [email protected] >>> >>> Hi, >>> >>> I recently started encountering a strange error where a Dataflow job >>> launched from a template never completes, but runs when launched directly. >>> The template has been in use since Dec 14 without issue, but trying to >>> recreate the template today (or the past week) and executing it, results in >>> one stage of the job sitting at 100% complete for hours, and never >>> completing. >>> >>> When trying to run the job directly (i.e. not via template) today, the >>> Logs Explorer has a confusing message, but does complete: >>> Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be >>> processing element 535 yet only 535 elements have been sent >>> >>> When trying to run via template, the following three errors show up: >>> >>> Element processed sanity check disabled due to SDK not reporting number >>> of elements processed. >>> >>> Error requesting progress from SDK: UNKNOWN: Traceback (most recent call >>> last): >>> File >>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", >>> line 667, in process_bundle_progress >>> processor = >>> self.bundle_processor_cache.lookup(request.instruction_id) >>> File >>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", >>> line 468, in lookup >>> raise RuntimeError( >>> RuntimeError: Bundle processing associated with >>> process_bundle-7395200449888031466-19 has failed. Check prior failing >>> response for details. >>> [ >>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] >>> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D> >>> { trail_point { source_file_loc { filepath: >>> "dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }'] >>> === Source Location Trace: === >>> dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183 >>> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800 >>> >>> SDK failed progress reporting 6 times (limit: 5), no longer holding back >>> progress to last SDK reported progress. >>> >>> None of these error messages show up in the template created on Dec 14, >>> so I'm unsure if some setting or default behavior has been changed or >>> what's going on. Any help or pointers to debug would be much appreciated. >>> >>> Thanks, >>> Patrick >>> >>
