Re: DataExchangeMode.BATCH in iterations

2016-02-05 Thread Stephan Ewen
Hi! Flink has a non-batch exchange way to break pipelines, which is by now quite custom for iterations. It is used there for constructs that fork and re-join the flow. The proper batch-exchange is better, because the scheduler can exploit that, but is is not yet usable in iterations. Stephan

Re: DataExchangeMode.BATCH in iterations

2016-02-01 Thread Fridtjof Sander
Hi Fabian, thanks for your explanation! Yeah, I figured that if an easy fix exists, you would have done that yourself. This is more for me to understand the conceptual problem. But back to the pipeline-requirement: Doesn't zipWithIndex violate that too, then? It's also a mapPartitions, collec

Re: DataExchangeMode.BATCH in iterations

2016-02-01 Thread Fabian Hueske
Hi Fridtjof, the range partitioner works by building a histogram for the partitioning key. This requires a pass over the whole intermediate data set which means it needs to be materialized and cannot be processed in a pipelined fashion. However, pipelined data exchange strategies are a requirement