Another person reported something similar for Dataflow and it seemed as
though in their scenario they were using locks and either got into a
deadlock or starved processing for long enough that the watchdog also
failed. Are you using locks and/or having really long single element
processing times?

On Mon, Aug 24, 2020 at 1:50 AM Junjian Xu <j...@indeed.com> wrote:

> Hi,
>
> I’m running into a problem of tensorflow-data-validation with direct
> runner to generate statistics from some large datasets over 400GB.
>
> It seems that all workers stopped working after an error message of
> “Keepalive watchdog fired. Closing transport.” It seems to be a grpc
> keepalive timeout.
>
> ```
> E0804 17:49:07.419950276   44806 chttp2_transport.cc:2881]
> ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport.
> 2020-08-04 17:49:07  local_job_service.py : INFO  Worker: severity: ERROR
> timestamp {   seconds: 1596563347   nanos: 420487403 } message: "Python sdk
> harness failed: \nTraceback (most recent call last):\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
> line 158, in main\n
>  sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
> line 213, in run\n    for work_request in
> self._control_stub.Control(get_responses()):\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
> 416, in __next__\n    return self._next()\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
> 706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous:
> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
> timeout\"\n\tdebug_error_string =
> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
> from peer
> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
> watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent
> call last):\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
> line 158, in main\n
>  sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
> line 213, in run\n    for work_request in
> self._control_stub.Control(get_responses()):\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
> 416, in __next__\n    return self._next()\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
> 706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous:
> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
> timeout\"\n\tdebug_error_string =
> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
> from peer
> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
> watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location:
> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161"
> thread: "MainThread"
> Traceback (most recent call last):
>   File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
>     "__main__", mod_spec)
>   File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
>     exec(code, run_globalse
>   File
> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 248, in <module>
>     main(sys.argv)
>   File
> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 158, in main
>     sdk_pipeline_options.view_as(ProfilingOptions))).run()
>   File
> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
> line 213, in run
>     for work_request in self._control_stub.Control(get_responses()):
>   File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
> line 416, in __next__
>     return self._next()
>   File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
> line 706, in _next
>     raise self
> grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC
> that terminated with:
>         status = StatusCode.UNAVAILABLE
>         details = "keepalive watchdog timeout"
>         debug_error_string =
> "{"created":"@1596563347.420024732","description":"Error received from peer
> ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive
> watchdog timeout","grpc_status":14}"
> ```
>
> I originally raised the issue in tensorflow-data-validation community but
> we couldn't come up with any solution.
> https://github.com/tensorflow/data-validation/issues/133
>
> The beam version is 2.22.0. Please let me know if I missed anything.
>
> Thanks,
> Junjian
>

Reply via email to