Do you already had a chance to look on it? If you need more information
just let me know.
On 12.10.2016 21:12, Jürgen Thomann wrote:
Thanks for your suggestions. We are using the DataStream API and I
tried it with disabling it completely, but that didn't help.
I attached the plan and to add some context, it starts with a Kafka
source followed by a map operation ( parallelism 4). The next map is
the expensive part with a parallelism of 18 which produces a Tuple2
which is used for splitting. Starting here the parallelism is always 2
except the sink with 1. Both resulting streams have two maps, a
filter, one more map and are ending with an
assignTimestampsAndWatermarks. If there is now a small box in the
picture it is a filter operation and otherwise it goes directly to a
keyBy, timewindow and apply operation followed by a sink.
If one task manager contains more sub tasks of the expensive map than
any other task manager, everything later in the stream is running on
the same task manager. If two task manager have the same amount of sub
tasks, the following tasks with a parallelism of 2 are distributed
over the two task manager.
Interesting is also that the task manager have 6 task slots configured
and the expensive part has 6 sub tasks on one task manager but still
everything later in the flow is running on this task manager. This
also happens if operator chaining is disabled.
On 12.10.2016 17:43, Robert Metzger wrote:
Are you using the DataStream or the DataSet API?
Maybe the operator chaining is causing too many operations to be
"packed" into one task. Check out this documentation page:
You could try to disable chaining completely to see if that resolves
the issue (you'll probably pay for this by having more serialization
overhead and network traffic).
If my suggestions don't help, can you post a screenshot of your job
plan (from the web interface) here, so that we see what operations
you are performing?