Re: HDFS data locality and distribution

2018-03-19 Thread Reinier Kip
From: Chesnay Schepler <ches...@apache.org> Sent: 13 March 2018 12:40:02 To: user@flink.apache.org Subject: Re: HDFS data locality and distribution Hello, You said that "data is distributed very badly across slots"; do you mean that only a small number of subtasks is reading fro

Re: HDFS data locality and distribution

2018-03-13 Thread Chesnay Schepler
9, Reinier Kip wrote: Relevant versions: Beam 2.1, Flink 1.3. *From:* Reinier Kip <r...@bol.com> *Sent:* 12 March 2018 13:45:47 *To:* user@flink.apache.org *Subject:* HDFS data locality and distribution Hey all, I'm

Re: HDFS data locality and distribution

2018-03-12 Thread Reinier Kip
Relevant versions: Beam 2.1, Flink 1.3. From: Reinier Kip <r...@bol.com> Sent: 12 March 2018 13:45:47 To: user@flink.apache.org Subject: HDFS data locality and distribution Hey all, I'm trying to batch-process 30-ish files from HDFS, but I see tha

HDFS data locality and distribution

2018-03-12 Thread Reinier Kip
Hey all, I'm trying to batch-process 30-ish files from HDFS, but I see that data is distributed very badly across slots. 4 out of 32 slots get 4/5ths of the data, another 3 slots get about 1/5th and a last slot just a few records. This probably triggers disk spillover on these slots and slows