Re: tensorflow-data-validation(with direct runner) failed to process large data because of grpc timeout on workers

2020-08-25 Thread Junjian Xu
Hi,

Thank you for your response.

I am not using apache-beam directly but using tensorflow-data-validation
API, so I'm sure about if there is any deadlock or not.
https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord

But what I can tell is that I didn't see any stack trace related to
deadlock. I've attached the stack (from pystack) of one worker.
Also the program is running well on a relatively small but still large
enough dataset, which is around 200GB and it took over 10mins to finish.

Thanks,
Junjian



On Tue, Aug 25, 2020 at 12:56 AM Luke Cwik  wrote:

> Another person reported something similar for Dataflow and it seemed as
> though in their scenario they were using locks and either got into a
> deadlock or starved processing for long enough that the watchdog also
> failed. Are you using locks and/or having really long single element
> processing times?
>
> On Mon, Aug 24, 2020 at 1:50 AM Junjian Xu  wrote:
>
>> Hi,
>>
>> I’m running into a problem of tensorflow-data-validation with direct
>> runner to generate statistics from some large datasets over 400GB.
>>
>> It seems that all workers stopped working after an error message of
>> “Keepalive watchdog fired. Closing transport.” It seems to be a grpc
>> keepalive timeout.
>>
>> ```
>> E0804 17:49:07.419950276   44806 chttp2_transport.cc:2881]
>> ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport.
>> 2020-08-04 17:49:07  local_job_service.py : INFO  Worker: severity: ERROR
>> timestamp {   seconds: 1596563347   nanos: 420487403 } message: "Python sdk
>> harness failed: \nTraceback (most recent call last):\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
>> line 158, in main\n
>>  sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
>> line 213, in run\nfor work_request in
>> self._control_stub.Control(get_responses()):\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
>> 416, in __next__\nreturn self._next()\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
>> 706, in _next\nraise self\ngrpc._channel._MultiThreadedRendezvous:
>> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
>> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
>> timeout\"\n\tdebug_error_string =
>> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
>> from peer
>> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
>> watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent
>> call last):\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
>> line 158, in main\n
>>  sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
>> line 213, in run\nfor work_request in
>> self._control_stub.Control(get_responses()):\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
>> 416, in __next__\nreturn self._next()\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
>> 706, in _next\nraise self\ngrpc._channel._MultiThreadedRendezvous:
>> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
>> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
>> timeout\"\n\tdebug_error_string =
>> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
>> from peer
>> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
>> watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location:
>> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161"
>> thread: "MainThread"
>> Traceback (most recent call last):
>>   File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
>> "__main__", mod_spec)
>>   File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
>> exec(code, run_globalse
>>   File
>> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
>> line 248, in 
>> main(sys.argv)
>>   File
>> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
>> line 158, in main
>> sdk_pipeline_options.view_as(ProfilingOptions))).run()
>>   File
>> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>> line 213, in run
>> for work_request in self._control_stub.Control(get_responses()):
>>   File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
>> line 416, in __next__
>> return self._next()
>>   File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
>> line 706, in 

Re: tensorflow-data-validation(with direct runner) failed to process large data because of grpc timeout on workers

2020-08-24 Thread Luke Cwik
Another person reported something similar for Dataflow and it seemed as
though in their scenario they were using locks and either got into a
deadlock or starved processing for long enough that the watchdog also
failed. Are you using locks and/or having really long single element
processing times?

On Mon, Aug 24, 2020 at 1:50 AM Junjian Xu  wrote:

> Hi,
>
> I’m running into a problem of tensorflow-data-validation with direct
> runner to generate statistics from some large datasets over 400GB.
>
> It seems that all workers stopped working after an error message of
> “Keepalive watchdog fired. Closing transport.” It seems to be a grpc
> keepalive timeout.
>
> ```
> E0804 17:49:07.419950276   44806 chttp2_transport.cc:2881]
> ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport.
> 2020-08-04 17:49:07  local_job_service.py : INFO  Worker: severity: ERROR
> timestamp {   seconds: 1596563347   nanos: 420487403 } message: "Python sdk
> harness failed: \nTraceback (most recent call last):\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
> line 158, in main\n
>  sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
> line 213, in run\nfor work_request in
> self._control_stub.Control(get_responses()):\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
> 416, in __next__\nreturn self._next()\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
> 706, in _next\nraise self\ngrpc._channel._MultiThreadedRendezvous:
> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
> timeout\"\n\tdebug_error_string =
> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
> from peer
> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
> watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent
> call last):\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
> line 158, in main\n
>  sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
> line 213, in run\nfor work_request in
> self._control_stub.Control(get_responses()):\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
> 416, in __next__\nreturn self._next()\n  File
> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
> 706, in _next\nraise self\ngrpc._channel._MultiThreadedRendezvous:
> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
> timeout\"\n\tdebug_error_string =
> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
> from peer
> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
> watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location:
> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161"
> thread: "MainThread"
> Traceback (most recent call last):
>   File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
> "__main__", mod_spec)
>   File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
> exec(code, run_globalse
>   File
> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 248, in 
> main(sys.argv)
>   File
> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
> line 158, in main
> sdk_pipeline_options.view_as(ProfilingOptions))).run()
>   File
> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
> line 213, in run
> for work_request in self._control_stub.Control(get_responses()):
>   File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
> line 416, in __next__
> return self._next()
>   File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
> line 706, in _next
> raise self
> grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC
> that terminated with:
> status = StatusCode.UNAVAILABLE
> details = "keepalive watchdog timeout"
> debug_error_string =
> "{"created":"@1596563347.420024732","description":"Error received from peer
> ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive
> watchdog timeout","grpc_status":14}"
> ```
>
> I originally raised the issue in tensorflow-data-validation community but
> we couldn't come up with any solution.
> https://github.com/tensorflow/data-validation/issues/133
>
> The beam version is 2.22.0. Please let me know if I 

tensorflow-data-validation(with direct runner) failed to process large data because of grpc timeout on workers

2020-08-24 Thread Junjian Xu
Hi,

I’m running into a problem of tensorflow-data-validation with direct runner
to generate statistics from some large datasets over 400GB.

It seems that all workers stopped working after an error message of
“Keepalive watchdog fired. Closing transport.” It seems to be a grpc
keepalive timeout.

```
E0804 17:49:07.419950276   44806 chttp2_transport.cc:2881]
ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport.
2020-08-04 17:49:07  local_job_service.py : INFO  Worker: severity: ERROR
timestamp {   seconds: 1596563347   nanos: 420487403 } message: "Python sdk
harness failed: \nTraceback (most recent call last):\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
line 158, in main\n
 sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
line 213, in run\nfor work_request in
self._control_stub.Control(get_responses()):\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
416, in __next__\nreturn self._next()\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
706, in _next\nraise self\ngrpc._channel._MultiThreadedRendezvous:
<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
timeout\"\n\tdebug_error_string =
\"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
from peer
ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent
call last):\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
line 158, in main\n
 sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
line 213, in run\nfor work_request in
self._control_stub.Control(get_responses()):\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
416, in __next__\nreturn self._next()\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
706, in _next\nraise self\ngrpc._channel._MultiThreadedRendezvous:
<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
timeout\"\n\tdebug_error_string =
\"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
from peer
ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location:
"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161"
thread: "MainThread"
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globalse
  File
"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 248, in 
main(sys.argv)
  File
"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 158, in main
sdk_pipeline_options.view_as(ProfilingOptions))).run()
  File
"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 213, in run
for work_request in self._control_stub.Control(get_responses()):
  File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
line 416, in __next__
return self._next()
  File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
line 706, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC
that terminated with:
status = StatusCode.UNAVAILABLE
details = "keepalive watchdog timeout"
debug_error_string =
"{"created":"@1596563347.420024732","description":"Error received from peer
ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive
watchdog timeout","grpc_status":14}"
```

I originally raised the issue in tensorflow-data-validation community but
we couldn't come up with any solution.
https://github.com/tensorflow/data-validation/issues/133

The beam version is 2.22.0. Please let me know if I missed anything.

Thanks,
Junjian