Re: OutOfMemoryException: unable to create native thread

2015-07-17 Thread Stephan Ewen
Right now, I would go with the extra field. The roadmap has pending features that improve the scheduling for plans like yours (with many data sources), but it is not yet in the code. On Fri, Jul 17, 2015 at 11:24 AM, chan fentes chanfen...@gmail.com wrote: I am testing my regex file input

Re: OutOfMemoryException: unable to create native thread

2015-07-17 Thread chan fentes
I am testing my regex file input format, but because I have a workflow that depends on the filename (each filename contains a number that I need), I need to add another field to each of my tuples. What is the best way to avoid this additional field, which I only need for grouping and one

Re: OutOfMemoryException: unable to create native thread

2015-07-01 Thread Till Rohrmann
Hi Chan, if you feel up to implementing such an input format, then you can also contribute it. You simply have to open a JIRA issue and take ownership of it. Cheers, Till On Wed, Jul 1, 2015 at 10:08 AM, chan fentes chanfen...@gmail.com wrote: Thank you all for your help and for pointing out

Re: OutOfMemoryException: unable to create native thread

2015-07-01 Thread Stephan Ewen
How about allowing also a varArg of multiple file names for the input format? We'd then have the option of - File or directory - List of files or directories - Base directory + regex that matches contained file paths On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier pomperma...@okkam.it

Re: OutOfMemoryException: unable to create native thread

2015-07-01 Thread chan fentes
Thank you all for your help and for pointing out different possibilities. It would be nice to have an input format that takes a directory and a regex pattern (for file names) to create one data source instead of 1500. This would have helped me to avoid the problem. Maybe this can be included in

Re: OutOfMemoryException: unable to create native thread

2015-06-30 Thread Ufuk Celebi
Hey Chan, the problem is that all sources are scheduled at once for pipelined execution mode (default). There is work in progress to support your workload better in batch execution mode, e.g. run each source one after the other and materialize intermediate results. This will hopefully be in

Re: OutOfMemoryException: unable to create native thread

2015-06-30 Thread Stephan Ewen
I agree with Aljoscha and Ufuk. As said, it will be hard for the system (currently) to handle 1500 sources, but handling a parallel source with 1500 files will be very efficient. This is possible, if all sources (files) deliver the same data type and would be unioned. If that is true, you can