Hi Bruno, Thanks for the response. The SDK version and all should be identical - this issue occurs using code from the exact same commit in git, and the dependencies are frozen. I should mention this is using the python SDK version 2.39.0.
Diffing between the templates only appears to show expected differences e.g. the tempStoragePrefix points to a different bucket, seralized_fns and windowing_strategy are different in a couple of locations but the difference is just in the name of a temp directory (e.g. tmpv396o090). And similarly, cannot see differences in Pipeline options. So I've been scratching my head trying to figure out what's going on here :/. I'll create a ticket with Dataflow support and see if there's something with the Dataflow runner that might be causing this issue! -Patrick On Tue, Jan 10, 2023 at 8:32 PM Bruno Volpato <[email protected]> wrote: > Hi Patrick, > > I have a few questions that might help troubleshoot this: > > Did you use the same SDK? Have you updated Beam or any other dependencies? > Are there any other error logs (prior to the trace above) that could help > understand it? > Do you still have the previous template so you can compare the contents? > (they are JSON, so formatting and diffing may be sufficient here.) > If not, I'd suggest comparing the "Job info" and "Pipeline options" for > possible environment/parameter changes. > > This might be related to a specific runner (Dataflow) rather than the SDK, > so if the above doesn't help, a good approach may be contacting Dataflow > support and providing specific job IDs so they can give a better look. > > Best, > Bruno > > > > On Tue, Jan 10, 2023 at 8:42 PM Patrick McQuighan via user < > [email protected]> wrote: > >> [email protected] >> >> Hi, >> >> I recently started encountering a strange error where a Dataflow job >> launched from a template never completes, but runs when launched directly. >> The template has been in use since Dec 14 without issue, but trying to >> recreate the template today (or the past week) and executing it, results in >> one stage of the job sitting at 100% complete for hours, and never >> completing. >> >> When trying to run the job directly (i.e. not via template) today, the >> Logs Explorer has a confusing message, but does complete: >> Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be >> processing element 535 yet only 535 elements have been sent >> >> When trying to run via template, the following three errors show up: >> >> Element processed sanity check disabled due to SDK not reporting number >> of elements processed. >> >> Error requesting progress from SDK: UNKNOWN: Traceback (most recent call >> last): >> File >> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", >> line 667, in process_bundle_progress >> processor = self.bundle_processor_cache.lookup(request.instruction_id) >> File >> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", >> line 468, in lookup >> raise RuntimeError( >> RuntimeError: Bundle processing associated with >> process_bundle-7395200449888031466-19 has failed. Check prior failing >> response for details. >> [ >> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] >> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D> >> { trail_point { source_file_loc { filepath: >> "dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }'] >> === Source Location Trace: === >> dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183 >> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800 >> >> SDK failed progress reporting 6 times (limit: 5), no longer holding back >> progress to last SDK reported progress. >> >> None of these error messages show up in the template created on Dec 14, >> so I'm unsure if some setting or default behavior has been changed or >> what's going on. Any help or pointers to debug would be much appreciated. >> >> Thanks, >> Patrick >> >
