I agree with Aljoscha and Ufuk.

As said, it will be hard for the system (currently) to handle 1500 sources,
but handling a parallel source with 1500 files will be very efficient.
This is possible, if all sources (files) deliver the same data type and
would be unioned.

If that is true, you can

 - Specify the input as a directory.

 - If you cannot do that, because there is no common parent directory, you
can "union" the files into one data source with a simple trick, as
described here:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html



On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi Chan,
> Flink sources support giving a directory as an input path in a source. If
> you do this it will read each of the files in that directory. They way you
> do it leads to a very big plan, because the plan will be replicated 1500
> times, this could lead to the OutOfMemoryException.
>
> Is there a specific reason why you create 1500 separate sources?
>
> Regards,
> Aljoscha
>
> On Tue, 30 Jun 2015 at 17:17 chan fentes <chanfen...@gmail.com> wrote:
>
>> Hello,
>>
>> how many data sources can I use in one Flink plan? Is there any limit? I
>> get an
>> java.lang.OutOfMemoryException: unable to create native thread
>> when having approx. 1500 files. What I basically do is the following:
>> DataSource ->Map -> Map -> GroupBy -> GroupReduce per file
>> and then
>> Union -> GroupBy -> Sum in a tree-like reduction.
>>
>> I have checked the workflow. It runs on a cluster without any problem, if
>> I only use few files. Does Flink use a thread per operator? It seems as if
>> I am limited in the amount of threads I can use. How can I avoid the
>> exception mentioned above?
>>
>> Best regards
>> Chan
>>
>

Reply via email to