I think there is currently no way of ensuring that tasks are spread out across
different machines because the scheduling logic does not take into account what
machine a slot is on. I currently see two workarounds:
- Let all operations have the same parallelism and only have 8 slots in your
cluster in total
- Let your sinks have parallelism 224 (same as the windows). I think multiple
sinks writing to the same Kafka partition should not be a problem. Unless
that's a problem in your setup, of course.
What do you think?
> On 2. Mar 2018, at 06:55, Dongwon Kim <eastcirc...@gmail.com> wrote:
> We have a standalone cluster where 1 JM and 7 TMs are running on 8 servers.
> We have total 224 cores (as each TM has 32 slots) and we want to use all
> slots for a single streaming job.
> The single job roughly consists of the following three types of tasks:
> - Kafka source tasks (Parallelism : 7 as the number of partitions in the
> input topic is 7)
> - Session window tasks (Parallelism : 224)
> - Kafka sink tasks (Parallelism : 7 as the number of partitions in the
> output topic is 7)
> We want 7 sources and 7 sinks to be evenly scheduled over different nodes.
> Source tasks are scheduled as wanted (see "1 source.png").
> <1 source.png>
> However, sink tasks are scheduled on a single node (see "2 sink.png").
> <2 sink.png>
> As we use the whole standalone only for a single job, this scheduling
> behavior causes the output of all the 224 session window tasks to be sent to
> a single physical machine.
> Is it because locality is only considered in Kafka source?
> I also check that different partitions are taken care by different brokers
> for both of the input topic and the output topic in Kafka.
> Do I miss something in order to spread Kafka sink tasks over different nodes?
> - Dongwon