Re: Limit on number of files to read for Dataset

Fabian Hueske Tue, 14 Aug 2018 00:47:41 -0700

Hi,

Flink InputFormats generate their InputSplits sequentially on the
JobManager.
These splits are stored in the heap of the JM process and handed out to
SourceTasks when they request them lazily.
Split assignment is done by a InputSplitAssigner, that can be customized.
FileInputFormats typically use a LocatableInputSplitAssigner which tries to
assign splits based on locality.

I see three potential problems:
1) InputSplit generation might take a long while. The JM is blocked until
splits are generated.
2) All InputSplits need to be stored on the JM heap. You might need to
assign more memory to the JM process.
3) Split assignment might take a while depending on the complexity of the
InputSplitAssigner. You can implement a custom assigner to make this more
efficient (from an assignment point of view).

Best, Fabian

2018-08-14 8:19 GMT+02:00 Jörn Franke <jornfra...@gmail.com>:

> It causes more overhead (processes etc) which might make it slower.
> Furthermore if you have them stored on HDFS then the bottleneck is the
> namenode which will have to answer millions of requests.
> The latter point will change in future Hadoop versions with
> http://ozone.hadoop.apache.org/
>
> On 13. Aug 2018, at 21:01, Darshan Singh <darshan.m...@gmail.com> wrote:
>
> Hi Guys,
>
> Is there a limit on number of files flink dataset can read? My question is
> will there be any sort of issues if I have say millions of files to read to
> create single dataset.
>
> Thanks
>
>

Re: Limit on number of files to read for Dataset

Reply via email to