Hey Chan,

the problem is that all sources are scheduled at once for pipelined execution 
mode (default). There is work in progress to support your workload better in 
batch execution mode, e.g. run each source one after the other and materialize 
intermediate results. This will hopefully be in the next 0.10 release.

At the moment the best thing you can do is as Aljoscha suggested. Does this 
work for you or does each file need different processing?

– Ufuk

On 30 Jun 2015, at 17:36, Aljoscha Krettek <aljos...@apache.org> wrote:

> Hi Chan,
> Flink sources support giving a directory as an input path in a source. If you 
> do this it will read each of the files in that directory. They way you do it 
> leads to a very big plan, because the plan will be replicated 1500 times, 
> this could lead to the OutOfMemoryException.
> 
> Is there a specific reason why you create 1500 separate sources?
> 
> Regards,
> Aljoscha
> 
> On Tue, 30 Jun 2015 at 17:17 chan fentes <chanfen...@gmail.com> wrote:
> Hello,
> 
> how many data sources can I use in one Flink plan? Is there any limit? I get 
> an
> java.lang.OutOfMemoryException: unable to create native thread
> when having approx. 1500 files. What I basically do is the following:
> DataSource ->Map -> Map -> GroupBy -> GroupReduce per file
> and then
> Union -> GroupBy -> Sum in a tree-like reduction.
> 
> I have checked the workflow. It runs on a cluster without any problem, if I 
> only use few files. Does Flink use a thread per operator? It seems as if I am 
> limited in the amount of threads I can use. How can I avoid the exception 
> mentioned above?
> 
> Best regards
> Chan

Reply via email to