If I’m not mistaken, then this shouldn’t solve the scheduling peculiarity
of Flink. Flink will still deploy the tasks of the flat map operation to
the machine where the source task is running. Only after this machine has
no more slots left, other machines will be used as well.

I think that you don’t need an explicit rebalance() method here. Flink will
automatically insert the PartitionMethod.REBALANCE strategy.

Cheers,
Till
​

On Wed, Feb 24, 2016 at 4:01 PM, Gábor Gévay <gga...@gmail.com> wrote:

> Hello,
>
> > // For each "filename" in list do...
> > DataSet<FeatureList> featureList = fileList
> >                 .flatMap(new ReadDataSetFromFile()) // flatMap because
> there
> > might multiple DataSets in a file
>
> What happens if you just insert .rebalance() before the flatMap?
>
> > This kind of DataSource will only be executed
> > with a degree of parallelism of 1. The source will send it’s collection
> > elements in a round robin fashion to the downstream operators which are
> > executed with a higher parallelism. So when Flink schedules the
> downstream
> > operators, it will try to place them close to their inputs. Since all
> flat
> > map operators have the single data source task as an input, they will be
> > deployed on the same machine if possible.
>
> Sorry, I'm a little confused here. Do you mean that the flatMap will
> have a high parallelism, but all instances on a single machine?
> Because I tried to reproduce the situation where I have a non-parallel
> data source and then a flatMap, and the plan shows that the flatMap
> actually has parallelism 1, which would be an alternative explanation
> to the original problem that it gets executed on a single machine.
> Then, if I insert .rebalance() after the source, then a "Partition"
> operation appears between the source and the flatMap, and the flatMap
> has a high parallelism. I think this should also solve the problem,
> without having to write a parallel data source.
>
> Best,
> Gábor
>

Reply via email to