Re: OutOfMemoryException: unable to create native thread

2015-07-17 Thread Stephan Ewen
Right now, I would go with the extra field.

The roadmap has pending features that improve the scheduling for plans like
yours (with many data sources), but it is not yet in the code.

On Fri, Jul 17, 2015 at 11:24 AM, chan fentes chanfen...@gmail.com wrote:

 I am testing my regex file input format, but because I have a workflow
 that depends on the filename (each filename contains a number that I need),
 I need to add another field to each of my tuples. What is the best way to
 avoid this additional field, which I only need for grouping and one
 multiplication (in a MapFunction) late in my workflow? An easy way would be
 to do the multiplication in the input format, however I need the value also
 for grouping.
 If I were able to use many data sources (one for each file), I could avoid
 the additional field (no grouping per file required) and possibly decrease
 the runtime of the plan(s).

 Thanks in advance for your help.

 2015-07-01 10:20 GMT+02:00 Stephan Ewen se...@apache.org:

 How about allowing also a varArg of multiple file names for the input
 format?

 We'd then have the option of

  - File or directory
  - List of files or directories
  - Base directory + regex that matches contained file paths



 On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier pomperma...@okkam.it
  wrote:

 +1 :)

 On Wed, Jul 1, 2015 at 10:08 AM, chan fentes chanfen...@gmail.com
 wrote:

 Thank you all for your help and for pointing out different
 possibilities.
 It would be nice to have an input format that takes a directory and a
 regex pattern (for file names) to create one data source instead of 1500.
 This would have helped me to avoid the problem. Maybe this can be included
 in one of the future releases. ;)

 2015-06-30 19:02 GMT+02:00 Stephan Ewen se...@apache.org:

 I agree with Aljoscha and Ufuk.

 As said, it will be hard for the system (currently) to handle 1500
 sources, but handling a parallel source with 1500 files will be very
 efficient.
 This is possible, if all sources (files) deliver the same data type
 and would be unioned.

 If that is true, you can

  - Specify the input as a directory.

  - If you cannot do that, because there is no common parent directory,
 you can union the files into one data source with a simple trick, as
 described here:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html



 On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek aljos...@apache.org
  wrote:

 Hi Chan,
 Flink sources support giving a directory as an input path in a
 source. If you do this it will read each of the files in that directory.
 They way you do it leads to a very big plan, because the plan will be
 replicated 1500 times, this could lead to the OutOfMemoryException.

 Is there a specific reason why you create 1500 separate sources?

 Regards,
 Aljoscha

 On Tue, 30 Jun 2015 at 17:17 chan fentes chanfen...@gmail.com
 wrote:

 Hello,

 how many data sources can I use in one Flink plan? Is there any
 limit? I get an
 java.lang.OutOfMemoryException: unable to create native thread
 when having approx. 1500 files. What I basically do is the following:
 DataSource -Map - Map - GroupBy - GroupReduce per file
 and then
 Union - GroupBy - Sum in a tree-like reduction.

 I have checked the workflow. It runs on a cluster without any
 problem, if I only use few files. Does Flink use a thread per operator? 
 It
 seems as if I am limited in the amount of threads I can use. How can I
 avoid the exception mentioned above?

 Best regards
 Chan










Re: OutOfMemoryException: unable to create native thread

2015-07-17 Thread chan fentes
I am testing my regex file input format, but because I have a workflow that
depends on the filename (each filename contains a number that I need), I
need to add another field to each of my tuples. What is the best way to
avoid this additional field, which I only need for grouping and one
multiplication (in a MapFunction) late in my workflow? An easy way would be
to do the multiplication in the input format, however I need the value also
for grouping.
If I were able to use many data sources (one for each file), I could avoid
the additional field (no grouping per file required) and possibly decrease
the runtime of the plan(s).

Thanks in advance for your help.

2015-07-01 10:20 GMT+02:00 Stephan Ewen se...@apache.org:

 How about allowing also a varArg of multiple file names for the input
 format?

 We'd then have the option of

  - File or directory
  - List of files or directories
  - Base directory + regex that matches contained file paths



 On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 +1 :)

 On Wed, Jul 1, 2015 at 10:08 AM, chan fentes chanfen...@gmail.com
 wrote:

 Thank you all for your help and for pointing out different possibilities.
 It would be nice to have an input format that takes a directory and a
 regex pattern (for file names) to create one data source instead of 1500.
 This would have helped me to avoid the problem. Maybe this can be included
 in one of the future releases. ;)

 2015-06-30 19:02 GMT+02:00 Stephan Ewen se...@apache.org:

 I agree with Aljoscha and Ufuk.

 As said, it will be hard for the system (currently) to handle 1500
 sources, but handling a parallel source with 1500 files will be very
 efficient.
 This is possible, if all sources (files) deliver the same data type and
 would be unioned.

 If that is true, you can

  - Specify the input as a directory.

  - If you cannot do that, because there is no common parent directory,
 you can union the files into one data source with a simple trick, as
 described here:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html



 On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek aljos...@apache.org
 wrote:

 Hi Chan,
 Flink sources support giving a directory as an input path in a source.
 If you do this it will read each of the files in that directory. They way
 you do it leads to a very big plan, because the plan will be replicated
 1500 times, this could lead to the OutOfMemoryException.

 Is there a specific reason why you create 1500 separate sources?

 Regards,
 Aljoscha

 On Tue, 30 Jun 2015 at 17:17 chan fentes chanfen...@gmail.com wrote:

 Hello,

 how many data sources can I use in one Flink plan? Is there any
 limit? I get an
 java.lang.OutOfMemoryException: unable to create native thread
 when having approx. 1500 files. What I basically do is the following:
 DataSource -Map - Map - GroupBy - GroupReduce per file
 and then
 Union - GroupBy - Sum in a tree-like reduction.

 I have checked the workflow. It runs on a cluster without any
 problem, if I only use few files. Does Flink use a thread per operator? 
 It
 seems as if I am limited in the amount of threads I can use. How can I
 avoid the exception mentioned above?

 Best regards
 Chan









Re: OutOfMemoryException: unable to create native thread

2015-07-01 Thread Till Rohrmann
Hi Chan,

if you feel up to implementing such an input format, then you can also
contribute it. You simply have to open a JIRA issue and take ownership of
it.

Cheers,
Till

On Wed, Jul 1, 2015 at 10:08 AM, chan fentes chanfen...@gmail.com wrote:

 Thank you all for your help and for pointing out different possibilities.
 It would be nice to have an input format that takes a directory and a
 regex pattern (for file names) to create one data source instead of 1500.
 This would have helped me to avoid the problem. Maybe this can be included
 in one of the future releases. ;)

 2015-06-30 19:02 GMT+02:00 Stephan Ewen se...@apache.org:

 I agree with Aljoscha and Ufuk.

 As said, it will be hard for the system (currently) to handle 1500
 sources, but handling a parallel source with 1500 files will be very
 efficient.
 This is possible, if all sources (files) deliver the same data type and
 would be unioned.

 If that is true, you can

  - Specify the input as a directory.

  - If you cannot do that, because there is no common parent directory,
 you can union the files into one data source with a simple trick, as
 described here:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html



 On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek aljos...@apache.org
 wrote:

 Hi Chan,
 Flink sources support giving a directory as an input path in a source.
 If you do this it will read each of the files in that directory. They way
 you do it leads to a very big plan, because the plan will be replicated
 1500 times, this could lead to the OutOfMemoryException.

 Is there a specific reason why you create 1500 separate sources?

 Regards,
 Aljoscha

 On Tue, 30 Jun 2015 at 17:17 chan fentes chanfen...@gmail.com wrote:

 Hello,

 how many data sources can I use in one Flink plan? Is there any limit?
 I get an
 java.lang.OutOfMemoryException: unable to create native thread
 when having approx. 1500 files. What I basically do is the following:
 DataSource -Map - Map - GroupBy - GroupReduce per file
 and then
 Union - GroupBy - Sum in a tree-like reduction.

 I have checked the workflow. It runs on a cluster without any problem,
 if I only use few files. Does Flink use a thread per operator? It seems as
 if I am limited in the amount of threads I can use. How can I avoid the
 exception mentioned above?

 Best regards
 Chan






Re: OutOfMemoryException: unable to create native thread

2015-07-01 Thread Stephan Ewen
How about allowing also a varArg of multiple file names for the input
format?

We'd then have the option of

 - File or directory
 - List of files or directories
 - Base directory + regex that matches contained file paths



On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 +1 :)

 On Wed, Jul 1, 2015 at 10:08 AM, chan fentes chanfen...@gmail.com wrote:

 Thank you all for your help and for pointing out different possibilities.
 It would be nice to have an input format that takes a directory and a
 regex pattern (for file names) to create one data source instead of 1500.
 This would have helped me to avoid the problem. Maybe this can be included
 in one of the future releases. ;)

 2015-06-30 19:02 GMT+02:00 Stephan Ewen se...@apache.org:

 I agree with Aljoscha and Ufuk.

 As said, it will be hard for the system (currently) to handle 1500
 sources, but handling a parallel source with 1500 files will be very
 efficient.
 This is possible, if all sources (files) deliver the same data type and
 would be unioned.

 If that is true, you can

  - Specify the input as a directory.

  - If you cannot do that, because there is no common parent directory,
 you can union the files into one data source with a simple trick, as
 described here:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html



 On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek aljos...@apache.org
 wrote:

 Hi Chan,
 Flink sources support giving a directory as an input path in a source.
 If you do this it will read each of the files in that directory. They way
 you do it leads to a very big plan, because the plan will be replicated
 1500 times, this could lead to the OutOfMemoryException.

 Is there a specific reason why you create 1500 separate sources?

 Regards,
 Aljoscha

 On Tue, 30 Jun 2015 at 17:17 chan fentes chanfen...@gmail.com wrote:

 Hello,

 how many data sources can I use in one Flink plan? Is there any limit?
 I get an
 java.lang.OutOfMemoryException: unable to create native thread
 when having approx. 1500 files. What I basically do is the following:
 DataSource -Map - Map - GroupBy - GroupReduce per file
 and then
 Union - GroupBy - Sum in a tree-like reduction.

 I have checked the workflow. It runs on a cluster without any problem,
 if I only use few files. Does Flink use a thread per operator? It seems as
 if I am limited in the amount of threads I can use. How can I avoid the
 exception mentioned above?

 Best regards
 Chan








Re: OutOfMemoryException: unable to create native thread

2015-07-01 Thread chan fentes
Thank you all for your help and for pointing out different possibilities.
It would be nice to have an input format that takes a directory and a regex
pattern (for file names) to create one data source instead of 1500. This
would have helped me to avoid the problem. Maybe this can be included in
one of the future releases. ;)

2015-06-30 19:02 GMT+02:00 Stephan Ewen se...@apache.org:

 I agree with Aljoscha and Ufuk.

 As said, it will be hard for the system (currently) to handle 1500
 sources, but handling a parallel source with 1500 files will be very
 efficient.
 This is possible, if all sources (files) deliver the same data type and
 would be unioned.

 If that is true, you can

  - Specify the input as a directory.

  - If you cannot do that, because there is no common parent directory, you
 can union the files into one data source with a simple trick, as
 described here:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html



 On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek aljos...@apache.org
 wrote:

 Hi Chan,
 Flink sources support giving a directory as an input path in a source. If
 you do this it will read each of the files in that directory. They way you
 do it leads to a very big plan, because the plan will be replicated 1500
 times, this could lead to the OutOfMemoryException.

 Is there a specific reason why you create 1500 separate sources?

 Regards,
 Aljoscha

 On Tue, 30 Jun 2015 at 17:17 chan fentes chanfen...@gmail.com wrote:

 Hello,

 how many data sources can I use in one Flink plan? Is there any limit? I
 get an
 java.lang.OutOfMemoryException: unable to create native thread
 when having approx. 1500 files. What I basically do is the following:
 DataSource -Map - Map - GroupBy - GroupReduce per file
 and then
 Union - GroupBy - Sum in a tree-like reduction.

 I have checked the workflow. It runs on a cluster without any problem,
 if I only use few files. Does Flink use a thread per operator? It seems as
 if I am limited in the amount of threads I can use. How can I avoid the
 exception mentioned above?

 Best regards
 Chan





Re: OutOfMemoryException: unable to create native thread

2015-06-30 Thread Ufuk Celebi
Hey Chan,

the problem is that all sources are scheduled at once for pipelined execution 
mode (default). There is work in progress to support your workload better in 
batch execution mode, e.g. run each source one after the other and materialize 
intermediate results. This will hopefully be in the next 0.10 release.

At the moment the best thing you can do is as Aljoscha suggested. Does this 
work for you or does each file need different processing?

– Ufuk

On 30 Jun 2015, at 17:36, Aljoscha Krettek aljos...@apache.org wrote:

 Hi Chan,
 Flink sources support giving a directory as an input path in a source. If you 
 do this it will read each of the files in that directory. They way you do it 
 leads to a very big plan, because the plan will be replicated 1500 times, 
 this could lead to the OutOfMemoryException.
 
 Is there a specific reason why you create 1500 separate sources?
 
 Regards,
 Aljoscha
 
 On Tue, 30 Jun 2015 at 17:17 chan fentes chanfen...@gmail.com wrote:
 Hello,
 
 how many data sources can I use in one Flink plan? Is there any limit? I get 
 an
 java.lang.OutOfMemoryException: unable to create native thread
 when having approx. 1500 files. What I basically do is the following:
 DataSource -Map - Map - GroupBy - GroupReduce per file
 and then
 Union - GroupBy - Sum in a tree-like reduction.
 
 I have checked the workflow. It runs on a cluster without any problem, if I 
 only use few files. Does Flink use a thread per operator? It seems as if I am 
 limited in the amount of threads I can use. How can I avoid the exception 
 mentioned above?
 
 Best regards
 Chan



Re: OutOfMemoryException: unable to create native thread

2015-06-30 Thread Stephan Ewen
I agree with Aljoscha and Ufuk.

As said, it will be hard for the system (currently) to handle 1500 sources,
but handling a parallel source with 1500 files will be very efficient.
This is possible, if all sources (files) deliver the same data type and
would be unioned.

If that is true, you can

 - Specify the input as a directory.

 - If you cannot do that, because there is no common parent directory, you
can union the files into one data source with a simple trick, as
described here:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html



On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek aljos...@apache.org
wrote:

 Hi Chan,
 Flink sources support giving a directory as an input path in a source. If
 you do this it will read each of the files in that directory. They way you
 do it leads to a very big plan, because the plan will be replicated 1500
 times, this could lead to the OutOfMemoryException.

 Is there a specific reason why you create 1500 separate sources?

 Regards,
 Aljoscha

 On Tue, 30 Jun 2015 at 17:17 chan fentes chanfen...@gmail.com wrote:

 Hello,

 how many data sources can I use in one Flink plan? Is there any limit? I
 get an
 java.lang.OutOfMemoryException: unable to create native thread
 when having approx. 1500 files. What I basically do is the following:
 DataSource -Map - Map - GroupBy - GroupReduce per file
 and then
 Union - GroupBy - Sum in a tree-like reduction.

 I have checked the workflow. It runs on a cluster without any problem, if
 I only use few files. Does Flink use a thread per operator? It seems as if
 I am limited in the amount of threads I can use. How can I avoid the
 exception mentioned above?

 Best regards
 Chan