You might be running into https://github.com/apache/beam/issues/30867.

Among the error messages you mentioned, the  following is closer to
rootcause: ``Error message from worker: generic::internal: Error
encountered with the status channel: There are 10 consecutive failures
obtaining SDK worker status info from sdk-0-0. The last success response
was received 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00.
SDK worker appears to be permanently unresponsive. Aborting the SDK. For
more information, see:
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```

If mitigations in https://github.com/apache/beam/issues/30867 don't resolve
your issue, please see
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
for insturctions on how to find what causes the workers to be stuck.

Thanks!

On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr <
contact.wisniowskipi...@gmail.com> wrote:

> Hi,
> We are investigating an issue with our Python SDK streaming pipelines, and
> have few questions, but first context.
> Our stack:
> - Python SDK 2.54.0 but we tried also 2.55.1
> - DataFlow Streaming engine with sdk in container image (we tried also
> Prime)
> - Currently our pipelines do have low enough traffic, so that single node
> handles it most of the time, but occasionally we do scale up.
> - Deployment by Terraform `google_dataflow_flex_template_job` resource,
> which normally does job update when re-applying Terraform.
> - We do use a lot `ReadModifyWriteStateSpec`, other states and watermark
> timers, but we do keep a the size of state under control.
> - We do use custom coders as Pydantic avro.
> The issue:
> - Occasionally watermark progression stops. The issue is not
> deterministic, and happens like 1-2 per day for few pipelines.
> - No user code errors reported- but we do get errors like this:
> ```INTERNAL: The work item requesting state read is no longer valid on the
> backend. The work has already completed or will be retried. This is
> expected during autoscaling events. [
> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
> { trail_point { source_file_loc { filepath:
> "dist_proc/windmill/client/streaming_rpc_client.cc" line: 767 } } }']```
> ```ABORTED: SDK harness sdk-0-0 disconnected. This usually means that the
> process running the pipeline code has crashed. Inspect the Worker Logs and
> the Diagnostics tab to determine the cause of the crash. [
> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
> { trail_point { source_file_loc { filepath:
> "dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217 } } }
> [dist_proc.dax.MessageCode] { origin_id: 5391582787251181999
> [dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT }']```
> ```Work item for sharding key 8dd4578b4f280f5d tokens
> (1316764909133315359, 17766288489530478880) encountered error during
> processing, will be retried (possibly on another worker):
> generic::internal: Error encountered with the status channel: SDK harness
> sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDK
> disconnect.```
> ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal: segmentation
> fault (core dumped) restarting SDK process```
> - We did manage to correlate this with either vertical autoscaling event
> (when using Prime) or other worker replacements done by Dataflow under the
> hood, but this is not deterministic.
> - For few hours watermark progress does stop, but other workers do
> process messages.
> - and after few hours:
> ```Error message from worker: generic::internal: Error encountered with
> the status channel: There are 10 consecutive failures obtaining SDK worker
> status info from sdk-0-0. The last success response was received
> 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker
> appears to be permanently unresponsive. Aborting the SDK. For more
> information, see:
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
> ```
> - And the pipeline starts to catch up and watermark progresses again.
> - Job update by Terraform apply also fixes the issue.
> - We do not see any extensive use of worker memory nor disk. CPU
> utilization is also most of the time close to idle. I do not think we do
> use C/C++ code with python. Nor use parallelism/threads outside beam
> parallelization.
> Questions:
> 1. What could be potential causes of such behavior? How to get more
> insights to this problem?
> 2. I have seen `In Python pipelines, when shutting down inactive bundle
> processors, shutdown logic can overaggressively hold the lock, blocking
> acceptance of new work` in Beam release docs as known issue. What is the
> status of this? Can this potentially be related?
> Really appreciate any help, clues or hints how to debug this issue.
> Best regards
> Wiśniowski Piotr
>

Reply via email to