Google Dataflow not distributing load across workers

Carlos Alonso Wed, 24 Jan 2018 04:17:32 -0800

Hello everyone!!

I'm experiencing a weird issue I'd like to understand. I have a workload
that basically reads data from PubSub and stores it, organised by types and
windows, in GCS.


When I run it on low load it works fine, it actually only needs one worker,
but if I try to increase the load by adding more data into PubSub then it
stresses the system a little more and makes Dataflow to autoscale
(THROUGHPUT_BASED algorithm).

All ok up to this point, the weird issue is that, on the graphs, those
newly added workers are idle!! (3% CPU usage vs 90% the original one and
something similar in terms of memory usage). The pipeline starts struggling
and the System lag starts growing. After a while OOMs start happening and
then the whole pipeline seems to stuck and I have to cancel it.

Looking at the logs after the autoscaling point a few lines appear
continuously and I'd like to understand if that's the reason why workers
are not receiving work and basically understand what they mean.

This log entries below appear continuously and interleaved. It suggests me
that they are all part of the same issue repeating over and over again.

* Setting node annotation to enable volume controller attach/detach
* GetWork timed out, retrying
* 860.865: [Full GC (Ergonomics) [PSYoungGen: 633856K->143127K(1267200K)]
[ParOldGen: 3800948K->3801001K(3801088K)] 4434804K->3944129K(5068288K),
[Metaspace: 133169K->132269K(1204224K)], 3.0900678 secs] [Times: user=9.14
sys=0.06, real=3.09 secs]
* [GC (Allocation Failure) [PSYoungGen: 243712K->9712K(253440K)]
250764K->31183K(338944K), 0.0321758 secs] [Times: user=0.06 sys=0.03,
real=0.03 secs]
* myjobname-01240255-93ce-harness-rts4 Got error NOT_FOUND: No setup task
returned in work item. while requesting setup task
* myjobname-01240255-93ce-harness-0jwl Check failed:
stats.required_used_bytes < 1LL << 40 S5 Realtime Timers
* Error syncing pod 689facdfb2b53aa2cff2f09d1c63fc5a
("dataflow-myjobname-01240255-93ce-harness-0jwl_default(689facdfb2b53aa2cff2f09d1c63fc5a)"),
skipping: failed to "StartContainer" for "windmill" with CrashLoopBackOff:
"Back-off 40s restarting failed container=windmill
pod=dataflow-myjobname-01240255-93ce-harness-0jwl_default(689facdfb2b53aa2cff2f09d1c63fc5a)"

Thanks for your help!!

Google Dataflow not distributing load across workers

Reply via email to