Re: HDFS data locality and distribution, Flink

2018-03-19 Thread Reinier Kip
was harder to match to each file in one big PCollection. Reinier From: Aljoscha Krettek <aljos...@apache.org> Sent: 13 March 2018 18:29:52 To: user@beam.apache.org Subject: Re: HDFS data locality and distribution, Flink Hi, There should be no data-locality awa

Re: HDFS data locality and distribution, Flink

2018-03-13 Thread Aljoscha Krettek
Mar 2018, at 05:50, Reinier Kip <r...@bol.com> wrote: > > Relevant versions: Beam 2.1, Flink 1.3. > From: Reinier Kip <r...@bol.com> > Sent: 12 March 2018 13:46:24 > To: user@beam.apache.org > Subject: HDFS data locality and distribution, Flink > > Hey all,

Re: HDFS data locality and distribution, Flink

2018-03-12 Thread Reinier Kip
Relevant versions: Beam 2.1, Flink 1.3. From: Reinier Kip <r...@bol.com> Sent: 12 March 2018 13:46:24 To: user@beam.apache.org Subject: HDFS data locality and distribution, Flink Hey all, I'm trying to batch-process 30-ish files from HDFS, but I see tha

HDFS data locality and distribution, Flink

2018-03-12 Thread Reinier Kip
Hey all, I'm trying to batch-process 30-ish files from HDFS, but I see that data is distributed very badly across slots. 4 out of 32 slots get 4/5ths of the data, another 3 slots get about 1/5th and a last slot just a few records. This probably triggers disk spillover on these slots and slows