Hi,

Getting back with update. After updating `grpcio` the issues are gone. Thank you for the solution and investigation. Feels like I own you a beer :)

Best

Wiśniowski Piotr

On 24.04.2024 22:11, Valentyn Tymofieiev wrote:


On Wed, Apr 24, 2024 at 12:40 PM Wiśniowski Piotr <contact.wisniowskipi...@gmail.com> wrote:

    Hi!

    Thank you for the hint. We will try with the mitigation from the
    issue. We did already tried everything from
    
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
    , but lets hope upgrading the dependency will help. Will keep
    reply to this thread once I get confirmation.

    BTW great job on the investigation of bug that you mentioned.
    Impressive. Seems like a nasty one.

Thanks. I was specifically recommending you check the recently added content under "It might be possible to retrieve stacktraces of a thread that is holding the GIL on a running Dataflow worker as follows:", as that should help find out what is causing stuckness in your case. But hopefully it won't be necessary after you adjust the grpcio version.

    Best,

    Wiśniowski Piotr

    On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote:
    You might be running into
    https://github.com/apache/beam/issues/30867.

    Among the error messages you mentioned, the  following is closer
    to rootcause: ``Error message from worker: generic::internal:
    Error encountered with the status channel: There are 10
    consecutive failures obtaining SDK worker status info from
    sdk-0-0. The last success response was received 3h20m2.648304212s
    ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker appears to
    be permanently unresponsive. Aborting the SDK. For more
    information, see:
    
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```
    
<https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact>

    If mitigations in https://github.com/apache/beam/issues/30867
    don't resolve your issue, please see
    
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
    for insturctions on how to find what causes the workers to be stuck.

    Thanks!

    On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr
    <contact.wisniowskipi...@gmail.com> wrote:

        Hi,
        We are investigating an issue with our Python SDK streaming
        pipelines, and have few questions, but first context.
        Our stack:
        - Python SDK 2.54.0 but we tried also 2.55.1
        - DataFlow Streaming engine with sdk in container image (we
        tried also Prime)
        - Currently our pipelines do have low enough traffic, so that
        single node handles it most of the time, but occasionally we
        do scale up.
        - Deployment by Terraform `google_dataflow_flex_template_job`
        resource, which normally does job update when re-applying
        Terraform.
        - We do use a lot `ReadModifyWriteStateSpec`, other states
        and watermark timers, but we do keep a the size of state
        under control.
        - We do use custom coders as Pydantic avro.
        The issue:
        - Occasionally watermark progression stops. The issue is not
        deterministic, and happens like 1-2 per day for few pipelines.
        - No user code errors reported- but we do get errors like this:
        ```INTERNAL: The work item requesting state read is no longer
        valid on the backend. The work has already completed or will
        be retried. This is expected during autoscaling events.
        
[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
        
<http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
        { trail_point { source_file_loc { filepath:
        "dist_proc/windmill/client/streaming_rpc_client.cc" line: 767
        } } }']```
        ```ABORTED: SDK harness sdk-0-0 disconnected. This usually
        means that the process running the pipeline code has crashed.
        Inspect the Worker Logs and the Diagnostics tab to determine
        the cause of the crash.
        
[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
        
<http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
        { trail_point { source_file_loc { filepath:
        "dist_proc/dax/workflow/worker/fnapi_control_service.cc"
        line: 217 } } } [dist_proc.dax.MessageCode] { origin_id:
        5391582787251181999
        [dist_proc.dax.workflow.workflow_io_message_ext]:
        SDK_DISCONNECT }']```
        ```Work item for sharding key 8dd4578b4f280f5d tokens
        (1316764909133315359, 17766288489530478880) encountered error
        during processing, will be retried (possibly on another
        worker): generic::internal: Error encountered with the status
        channel: SDK harness sdk-0-0 disconnected. with MessageCode:
        (93f1db2f7a4a325c): SDK disconnect.```
        ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal:
        segmentation fault (core dumped) restarting SDK process```
        - We did manage to correlate this with either vertical
        autoscaling event (when using Prime) or other worker
        replacements done by Dataflow under the hood, but this is not
        deterministic.
        - For few hours watermark progress does stop, but other
        workers do process messages.
        - and after few hours:
        ```Error message from worker: generic::internal: Error
        encountered with the status channel: There are 10 consecutive
        failures obtaining SDK worker status info from sdk-0-0. The
        last success response was received 3h20m2.648304212s ago at
        2024-04-23T11:48:35.493682768+00:00. SDK worker appears to be
        permanently unresponsive. Aborting the SDK. For more
        information, see:
        
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```
        - And the pipeline starts to catch up and watermark
        progresses again.
        - Job update by Terraform apply also fixes the issue.
        - We do not see any extensive use of worker memory nor disk.
        CPU utilization is also most of the time close to idle. I do
        not think we do use C/C++ code with python. Nor use
        parallelism/threads outside beam parallelization.
        Questions:
        1. What could be potential causes of such behavior? How to
        get more insights to this problem?
        2. I have seen `In Python pipelines, when shutting down
        inactive bundle processors, shutdown logic can
        overaggressively hold the lock, blocking acceptance of new
        work` in Beam release docs as known issue. What is the status
        of this? Can this potentially be related?
        Really appreciate any help, clues or hints how to debug this
        issue.
        Best regards
        Wiśniowski Piotr

Reply via email to