Hi,

I think I finally managed to track down the difference - the dataflow job
runs correctly when it has the pipeline option tempLocation set (in
addition to temp_location).  I have been having issues trying to get that
field set via the gcloud CLI, but using the python SDK
<https://cloud.google.com/dataflow/docs/reference/rpc/google.dataflow.v1beta3#google.dataflow.v1beta3.LaunchTemplateParameters>
it is being set with "environment": {"tempLocation": temp_location} and
that will launch and execute as expected.  I'm not really sure what the
difference is.

Thanks,
Patrick

On Wed, Jan 11, 2023 at 8:45 AM Patrick McQuighan <[email protected]>
wrote:

> Hi Bruno,
> Thanks for the response.  The SDK version and all should be identical -
> this issue occurs using code from the exact same commit in git, and the
> dependencies are frozen.  I should mention this is using the python SDK
> version 2.39.0.
>
> Diffing between the templates only appears to show expected differences
> e.g. the tempStoragePrefix points to a different bucket, seralized_fns and
> windowing_strategy are different in a couple of locations but the
> difference is just in the name of a temp directory (e.g. tmpv396o090).  And
> similarly, cannot see differences in Pipeline options.  So I've been
> scratching my head trying to figure out what's going on here :/.
>
> I'll create a ticket with Dataflow support and see if there's something
> with the Dataflow runner that might be causing this issue!
>
> -Patrick
>
>
>
> On Tue, Jan 10, 2023 at 8:32 PM Bruno Volpato <[email protected]> wrote:
>
>> Hi Patrick,
>>
>> I have a few questions that might help troubleshoot this:
>>
>> Did you use the same SDK? Have you updated Beam or any other dependencies?
>> Are there any other error logs (prior to the trace above) that could help
>> understand it?
>> Do you still have the previous template so you can compare the contents?
>> (they are JSON, so formatting and diffing may be sufficient here.)
>> If not, I'd suggest comparing the "Job info" and "Pipeline options" for
>> possible environment/parameter changes.
>>
>> This might be related to a specific runner (Dataflow) rather than the
>> SDK, so if the above doesn't help, a good approach may be contacting
>> Dataflow support and providing specific job IDs so they can give a better
>> look.
>>
>> Best,
>> Bruno
>>
>>
>>
>> On Tue, Jan 10, 2023 at 8:42 PM Patrick McQuighan via user <
>> [email protected]> wrote:
>>
>>> [email protected]
>>>
>>> Hi,
>>>
>>> I recently started encountering a strange error where a Dataflow job
>>> launched from a template never completes, but runs when launched directly.
>>> The template has been in use since Dec 14 without issue, but trying to
>>> recreate the template today (or the past week) and executing it, results in
>>> one stage of the job sitting at 100% complete for hours, and never
>>> completing.
>>>
>>> When trying to run the job directly (i.e. not via template) today, the
>>> Logs Explorer has a confusing message, but does complete:
>>> Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be
>>> processing element 535 yet only 535 elements have been sent
>>>
>>> When trying to run via template, the following three errors show up:
>>>
>>> Element processed sanity check disabled due to SDK not reporting number
>>> of elements processed.
>>>
>>> Error requesting progress from SDK: UNKNOWN: Traceback (most recent call
>>> last):
>>>   File
>>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
>>> line 667, in process_bundle_progress
>>>     processor =
>>> self.bundle_processor_cache.lookup(request.instruction_id)
>>>   File
>>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
>>> line 468, in lookup
>>>     raise RuntimeError(
>>> RuntimeError: Bundle processing associated with
>>> process_bundle-7395200449888031466-19 has failed. Check prior failing
>>> response for details.
>>>  [
>>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>>> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
>>> { trail_point { source_file_loc { filepath:
>>> "dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }']
>>> === Source Location Trace: ===
>>> dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183
>>> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800
>>>
>>> SDK failed progress reporting 6 times (limit: 5), no longer holding back
>>> progress to last SDK reported progress.
>>>
>>> None of these error messages show up in the template created on Dec 14,
>>> so I'm unsure if some setting or default behavior has been changed or
>>> what's going on. Any help or pointers to debug would be much appreciated.
>>>
>>> Thanks,
>>> Patrick
>>>
>>

Reply via email to