Hi Garrett,
thanks for reporting back!
Glad you could resolve the issue :-)
Best, Fabian
2017-10-05 23:21 GMT+02:00 Garrett Barton :
> Fabian,
>
> Turns out I was wrong. My flow was in fact running in two separate jobs
> due to me trying to use a local variable
Fabian,
Turns out I was wrong. My flow was in fact running in two separate jobs
due to me trying to use a local variable calculated by
...distinct().count() in a downstream flow. The second flow indeed set
parallelism correctly! Thank you for the help. :)
On Wed, Oct 4, 2017 at 8:01 AM,
Hi Garrett,
that's strange. DataSet.reduceGroup() will create a non-parallel
GroupReduce operator.
So even without setting the parallelism manually to 1, the operator should
not run in parallel.
What might happen though is that a combiner is applied to locally reduce
the data before it is shipped
Gábor
,
Thank you for the reply, I gave that a go and the flow still showed
parallel 90 for each step. Is the ui not 100% accurate perhaps?
To get around it for now I implemented a partitioner that threw all the
data to the same partition, hack but works!
On Tue, Oct 3, 2017 at 4:12 AM, Gábor
Hi Garrett,
You can call .setParallelism(1) on just this operator:
ds.reduceGroup(new GroupReduceFunction...).setParallelism(1)
Best,
Gabor
On Mon, Oct 2, 2017 at 3:46 PM, Garrett Barton wrote:
> I have a complex alg implemented using the DataSet api and by default
I have a complex alg implemented using the DataSet api and by default it
runs with parallel 90 for good performance. At the end I want to perform a
clustering of the resulting data and to do that correctly I need to pass
all the data through a single thread/process.
I read in the docs that as