Relevant versions: Beam 2.1, Flink 1.3.
From: Reinier Kip <r...@bol.com>
Sent: 12 March 2018 13:45:47
Subject: HDFS data locality and distribution
I'm trying to batch-process 30-ish files from HDFS, but I see that data is
distributed very badly across slots. 4 out of 32 slots get 4/5ths of the data,
another 3 slots get about 1/5th and a last slot just a few records. This
probably triggers disk spillover on these slots and slows down the job
immensely. The data has many many unique keys and processing could be done in a
highly parallel manner. From what I understand, HDFS data locality governs
which splits are assigned to which subtask.
* I'm running a Beam on Flink on YARN pipeline.
* I'm reading 30-ish files, whose records are later grouped by their
millions of unique keys.
* For now, I have 8 task managers by 4 slots. Beam sets all subtasks to
have 32 parallelism.
* Data seems to be localised to 9 out of the 32 slots, 3 out of the 8 task
Does the statement of input split assignment ring true? Is the fact that data
isn't redistributed an effort from Flink to have high data locality, even if
this means disk spillover for a few slots/tms and idleness for others? Is there
any use for parallelism if work isn't distributed anyway?
Thanks for your time, Reinier