You might be running into https://github.com/apache/beam/issues/30867.
Among the error messages you mentioned, the following is closer to rootcause: ``Error message from worker: generic::internal: Error encountered with the status channel: There are 10 consecutive failures obtaining SDK worker status info from sdk-0-0. The last success response was received 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker appears to be permanently unresponsive. Aborting the SDK. For more information, see: https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact``` If mitigations in https://github.com/apache/beam/issues/30867 don't resolve your issue, please see https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact for insturctions on how to find what causes the workers to be stuck. Thanks! On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr < contact.wisniowskipi...@gmail.com> wrote: > Hi, > We are investigating an issue with our Python SDK streaming pipelines, and > have few questions, but first context. > Our stack: > - Python SDK 2.54.0 but we tried also 2.55.1 > - DataFlow Streaming engine with sdk in container image (we tried also > Prime) > - Currently our pipelines do have low enough traffic, so that single node > handles it most of the time, but occasionally we do scale up. > - Deployment by Terraform `google_dataflow_flex_template_job` resource, > which normally does job update when re-applying Terraform. > - We do use a lot `ReadModifyWriteStateSpec`, other states and watermark > timers, but we do keep a the size of state under control. > - We do use custom coders as Pydantic avro. > The issue: > - Occasionally watermark progression stops. The issue is not > deterministic, and happens like 1-2 per day for few pipelines. > - No user code errors reported- but we do get errors like this: > ```INTERNAL: The work item requesting state read is no longer valid on the > backend. The work has already completed or will be retried. This is > expected during autoscaling events. [ > type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] > <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D> > { trail_point { source_file_loc { filepath: > "dist_proc/windmill/client/streaming_rpc_client.cc" line: 767 } } }']``` > ```ABORTED: SDK harness sdk-0-0 disconnected. This usually means that the > process running the pipeline code has crashed. Inspect the Worker Logs and > the Diagnostics tab to determine the cause of the crash. [ > type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] > <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D> > { trail_point { source_file_loc { filepath: > "dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217 } } } > [dist_proc.dax.MessageCode] { origin_id: 5391582787251181999 > [dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT }']``` > ```Work item for sharding key 8dd4578b4f280f5d tokens > (1316764909133315359, 17766288489530478880) encountered error during > processing, will be retried (possibly on another worker): > generic::internal: Error encountered with the status channel: SDK harness > sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDK > disconnect.``` > ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal: segmentation > fault (core dumped) restarting SDK process``` > - We did manage to correlate this with either vertical autoscaling event > (when using Prime) or other worker replacements done by Dataflow under the > hood, but this is not deterministic. > - For few hours watermark progress does stop, but other workers do > process messages. > - and after few hours: > ```Error message from worker: generic::internal: Error encountered with > the status channel: There are 10 consecutive failures obtaining SDK worker > status info from sdk-0-0. The last success response was received > 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker > appears to be permanently unresponsive. Aborting the SDK. For more > information, see: > https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact > ``` > - And the pipeline starts to catch up and watermark progresses again. > - Job update by Terraform apply also fixes the issue. > - We do not see any extensive use of worker memory nor disk. CPU > utilization is also most of the time close to idle. I do not think we do > use C/C++ code with python. Nor use parallelism/threads outside beam > parallelization. > Questions: > 1. What could be potential causes of such behavior? How to get more > insights to this problem? > 2. I have seen `In Python pipelines, when shutting down inactive bundle > processors, shutdown logic can overaggressively hold the lock, blocking > acceptance of new work` in Beam release docs as known issue. What is the > status of this? Can this potentially be related? > Really appreciate any help, clues or hints how to debug this issue. > Best regards > Wiśniowski Piotr >