Re: At end of complex parallel flow, how to force end step with parallel=1?

2017-10-06 Thread Fabian Hueske
Hi Garrett, thanks for reporting back! Glad you could resolve the issue :-) Best, Fabian 2017-10-05 23:21 GMT+02:00 Garrett Barton : > Fabian, > > Turns out I was wrong. My flow was in fact running in two separate jobs > due to me trying to use a local variable

Re: At end of complex parallel flow, how to force end step with parallel=1?

2017-10-05 Thread Garrett Barton
Fabian, Turns out I was wrong. My flow was in fact running in two separate jobs due to me trying to use a local variable calculated by ...distinct().count() in a downstream flow. The second flow indeed set parallelism correctly! Thank you for the help. :) On Wed, Oct 4, 2017 at 8:01 AM,

Re: At end of complex parallel flow, how to force end step with parallel=1?

2017-10-04 Thread Fabian Hueske
Hi Garrett, that's strange. DataSet.reduceGroup() will create a non-parallel GroupReduce operator. So even without setting the parallelism manually to 1, the operator should not run in parallel. What might happen though is that a combiner is applied to locally reduce the data before it is shipped

Re: At end of complex parallel flow, how to force end step with parallel=1?

2017-10-03 Thread Garrett Barton
Gábor ​, Thank you for the reply, I gave that a go and the flow still showed parallel 90 for each step. Is the ui not 100% accurate perhaps? To get around it for now I implemented a partitioner that threw all the data to the same partition, hack but works!​ On Tue, Oct 3, 2017 at 4:12 AM, Gábor

Re: At end of complex parallel flow, how to force end step with parallel=1?

2017-10-03 Thread Gábor Gévay
Hi Garrett, You can call .setParallelism(1) on just this operator: ds.reduceGroup(new GroupReduceFunction...).setParallelism(1) Best, Gabor On Mon, Oct 2, 2017 at 3:46 PM, Garrett Barton wrote: > I have a complex alg implemented using the DataSet api and by default

At end of complex parallel flow, how to force end step with parallel=1?

2017-10-02 Thread Garrett Barton
I have a complex alg implemented using the DataSet api and by default it runs with parallel 90 for good performance. At the end I want to perform a clustering of the resulting data and to do that correctly I need to pass all the data through a single thread/process. I read in the docs that as