Hi Philip,
Thank you for your questions. I think you have mapped the HIVE
functions to the Flink ones correctly. Just a remark on the ORDER BY.
You wrote that it produces a total order of all the records. In this
case, you'd have do a SortPartition operation with parallelism set to
1. This is
Hi Philip,
You're welcome. Just a small correction: Hive's SORT BY should be
DataSet.groupBy(key).sortGroup(key) in Flink. This ensures sorted
grouped records within the reducer that follows. No need to set the
parallelism to 1.
Best,
Max
On Mon, Oct 19, 2015 at 1:28 PM, Philip Lee
Thanks, Fabian.
I just want to check one thing again.
As you said, [Distribute By] is partitionByHash(). and [Sort By] should be
sortGroup on Flink. However, [Cluster By] is consist of partitionByHash().
*sortPartition()*.
As far as I know, [Cluster By] is same as the combination with
Hi Philip,
here a few additions to what Max said:
- ORDER BY: As Max said, Flink's sortPartition() does only sort with a
partition and does not produce a total order. You can either set the
parallelism to 1 as Max suggested or use a custom partitioner to range
partition the data.
- SORT BY: From
The difference between a partition and a group is the following:
- A partition refers to all records that are processed by a task instance
or sub task. If you use hash partitioning, all elements that share the same
key will be in one partition, but usually there will be more than one key
in a