Hi Bruno,
Thanks for the response.  The SDK version and all should be identical -
this issue occurs using code from the exact same commit in git, and the
dependencies are frozen.  I should mention this is using the python SDK
version 2.39.0.

Diffing between the templates only appears to show expected differences
e.g. the tempStoragePrefix points to a different bucket, seralized_fns and
windowing_strategy are different in a couple of locations but the
difference is just in the name of a temp directory (e.g. tmpv396o090).  And
similarly, cannot see differences in Pipeline options.  So I've been
scratching my head trying to figure out what's going on here :/.

I'll create a ticket with Dataflow support and see if there's something
with the Dataflow runner that might be causing this issue!

-Patrick



On Tue, Jan 10, 2023 at 8:32 PM Bruno Volpato <[email protected]> wrote:

> Hi Patrick,
>
> I have a few questions that might help troubleshoot this:
>
> Did you use the same SDK? Have you updated Beam or any other dependencies?
> Are there any other error logs (prior to the trace above) that could help
> understand it?
> Do you still have the previous template so you can compare the contents?
> (they are JSON, so formatting and diffing may be sufficient here.)
> If not, I'd suggest comparing the "Job info" and "Pipeline options" for
> possible environment/parameter changes.
>
> This might be related to a specific runner (Dataflow) rather than the SDK,
> so if the above doesn't help, a good approach may be contacting Dataflow
> support and providing specific job IDs so they can give a better look.
>
> Best,
> Bruno
>
>
>
> On Tue, Jan 10, 2023 at 8:42 PM Patrick McQuighan via user <
> [email protected]> wrote:
>
>> [email protected]
>>
>> Hi,
>>
>> I recently started encountering a strange error where a Dataflow job
>> launched from a template never completes, but runs when launched directly.
>> The template has been in use since Dec 14 without issue, but trying to
>> recreate the template today (or the past week) and executing it, results in
>> one stage of the job sitting at 100% complete for hours, and never
>> completing.
>>
>> When trying to run the job directly (i.e. not via template) today, the
>> Logs Explorer has a confusing message, but does complete:
>> Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be
>> processing element 535 yet only 535 elements have been sent
>>
>> When trying to run via template, the following three errors show up:
>>
>> Element processed sanity check disabled due to SDK not reporting number
>> of elements processed.
>>
>> Error requesting progress from SDK: UNKNOWN: Traceback (most recent call
>> last):
>>   File
>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
>> line 667, in process_bundle_progress
>>     processor = self.bundle_processor_cache.lookup(request.instruction_id)
>>   File
>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
>> line 468, in lookup
>>     raise RuntimeError(
>> RuntimeError: Bundle processing associated with
>> process_bundle-7395200449888031466-19 has failed. Check prior failing
>> response for details.
>>  [
>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
>> { trail_point { source_file_loc { filepath:
>> "dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }']
>> === Source Location Trace: ===
>> dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183
>> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800
>>
>> SDK failed progress reporting 6 times (limit: 5), no longer holding back
>> progress to last SDK reported progress.
>>
>> None of these error messages show up in the template created on Dec 14,
>> so I'm unsure if some setting or default behavior has been changed or
>> what's going on. Any help or pointers to debug would be much appreciated.
>>
>> Thanks,
>> Patrick
>>
>

Reply via email to