Functions with different parallelism cannot be chained.
Chaining means that Functions are fused into a single operator and that
records are passed by method calls (instead of serializing them into an
in-memory or network channel).
Hence, chaining does not work if you have different parallelism.
Be
someStream.filter(...).map(...).map(...);
there operators are supposed to chained.
but what if there are set different parallelism like below:
someStream.filter(...).setParallelism(X).map(...).setParallelism(Y).map(...).setParallelism(Z);
X != Y != Z
what will happen?
--
Sent from: http://apache
You could try to work around this using a custom Partioner [1].
myData.partitionCustom(new MyPartitioner(),
"myPartitionField").sortPartition("myPartitionField").writeToCsv(...);
In that case, you need to implement the Partition function yourself. To do
that "right" you need to know the value dis
Hi Fabian,
thanks for your reply, my question was exactly about that problem, range
partitioning.
As I have to process a large dataset of values, and to apply a datamining
algorythm on each partition, for me an important point is that the final
result is ordered, to do not lose the sense of data.
Hi Giacomo,
as Max said, you can sort the data within a partition.
However, data across partitions is not sorted. It is either random
partitioned or hash-partitioned (all records that share some property are
in the same partition). Producing fully ordered output, where the first
partition has all
Hi Giacomo,
If you use a FileOutputFormat as a DataSink (e.g. as in
env.writeAsText("/path"), then the output directory will contain 5 files
named 1, 2, 3, 4, and 5, each containing the output of the corresponding
task. The order of the data in the files follows the order of the
distributed DataSe
Hi Max,
thank you for your reply.
DataSink contains data ordered, I mean, it contains in order output1,
output1 ... output5? Or are them mixed?
Thanks a lot,
Giacomo
On Tue, Apr 14, 2015 at 11:58 AM, Maximilian Michels wrote:
> Hi Giacomo,
>
> If I understand you correctly, you want your Flink
Hi Giacomo,
If I understand you correctly, you want your Flink job to execute with a
parallelism of 5. Just call setDegreeOfParallelism(5) on your
ExecutionEnvironment. That way, all operations, when possible, will be
performed using 5 parallel instances. This is also true for the DataSink
which w
Hi guys,
I have a question about how parallelism works.
If I have a large dataset and I would divide it into 5 blocks, can I pass
each block of data to a fixed parallel process (for example I set up 5
process) ?
And if the results data from each process arrive to the output not in an
ordered way,