Hi Patrick, I have a few questions that might help troubleshoot this:
Did you use the same SDK? Have you updated Beam or any other dependencies? Are there any other error logs (prior to the trace above) that could help understand it? Do you still have the previous template so you can compare the contents? (they are JSON, so formatting and diffing may be sufficient here.) If not, I'd suggest comparing the "Job info" and "Pipeline options" for possible environment/parameter changes. This might be related to a specific runner (Dataflow) rather than the SDK, so if the above doesn't help, a good approach may be contacting Dataflow support and providing specific job IDs so they can give a better look. Best, Bruno On Tue, Jan 10, 2023 at 8:42 PM Patrick McQuighan via user < [email protected]> wrote: > [email protected] > > Hi, > > I recently started encountering a strange error where a Dataflow job > launched from a template never completes, but runs when launched directly. > The template has been in use since Dec 14 without issue, but trying to > recreate the template today (or the past week) and executing it, results in > one stage of the job sitting at 100% complete for hours, and never > completing. > > When trying to run the job directly (i.e. not via template) today, the > Logs Explorer has a confusing message, but does complete: > Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be > processing element 535 yet only 535 elements have been sent > > When trying to run via template, the following three errors show up: > > Element processed sanity check disabled due to SDK not reporting number of > elements processed. > > Error requesting progress from SDK: UNKNOWN: Traceback (most recent call > last): > File > "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 667, in process_bundle_progress > processor = self.bundle_processor_cache.lookup(request.instruction_id) > File > "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 468, in lookup > raise RuntimeError( > RuntimeError: Bundle processing associated with > process_bundle-7395200449888031466-19 has failed. Check prior failing > response for details. > [ > type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] > <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D> > { trail_point { source_file_loc { filepath: > "dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }'] > === Source Location Trace: === > dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183 > dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800 > > SDK failed progress reporting 6 times (limit: 5), no longer holding back > progress to last SDK reported progress. > > None of these error messages show up in the template created on Dec 14, so > I'm unsure if some setting or default behavior has been changed or what's > going on. Any help or pointers to debug would be much appreciated. > > Thanks, > Patrick >
