Hi Tim, unfortunately, this is not documented explicitly as far as I know. For the InputFormats there is a marker interface called NonParallelInput. The input formats which implement this interface will be executed with a parallelism of 1. At the moment this holds true for the CollectionInputFormat, IteratorInputFormat and the JDBCInputFormat.
I hope this helps. Cheers, Till On Tue, Feb 23, 2016 at 3:44 PM, Tim Conrad <con...@math.fu-berlin.de> wrote: > Hi Till (and others). > > Thank you very much for your helpful answer. > > On 23.02.2016 14:20, Till Rohrmann wrote: > > [...] In contrast, if you had a parallel data source which would consist > of multiple source task, then these tasks would be independent and spread > out across your cluster [...] > > > Can you please send me a link to an example or to the respective Flink API > doc, where I can see which is a parallel data source and how to create it > with multiple source tasks? > > A simple Google search did not provide me with an answer (maybe I used the > wrong key words, though...). > > > Cheers > Tim > > > > > On 23.02.2016 14:20, Till Rohrmann wrote: > > Hi Tim, > > depending on how you create the DataSource<String> fileList, Flink will > schedule the downstream operators differently. If you used the > ExecutionEnvironment.fromCollection method, then it will create a > DataSource with a CollectionInputFormat. This kind of DataSource will > only be executed with a degree of parallelism of 1. The source will send > it’s collection elements in a round robin fashion to the downstream > operators which are executed with a higher parallelism. So when Flink > schedules the downstream operators, it will try to place them close to > their inputs. Since all flat map operators have the single data source task > as an input, they will be deployed on the same machine if possible. > > In contrast, if you had a parallel data source which would consist of > multiple source task, then these tasks would be independent and spread out > across your cluster. In this case, every flat map task would have a single > distinct source task as input. When the flat map tasks are deployed they > would be deployed on the machine where their corresponding source is > running. Since the source tasks are spread out across the cluster, the flat > map tasks would be spread out as well. > > What you could do to mitigate your problem is to start the cluster with as > many slots as your maximum degree of parallelism is. That way, you’ll > utilize all cluster resources. > > I hope this clarifies a bit why you observe that tasks tend to cluster on > a single machine. > > Cheers, > Till > > > >