The Beam model doesn't support global sorting, [1] discusses in detail that
you might find useful.

[1]
https://lists.apache.org/thread.html/bc0e65a3bb653b8fd0db96bcd4c9da5af71a71af5a5639a472167808@1464278191@%3Cdev.beam.apache.org%3E

On 19 July 2017 at 02:45, Jan Lukavský <[email protected]> wrote:

> Hi all,
>
> I'm trying to get better understanding of Beam's internals for the sake of
> integration with Euphoria API as a DSL ([1]), and while trying to wrap
> Euphoria's abstractions of outputs, I came across a little issue, that I'm
> currently a little stuck with. The issue is not important to this question,
> but it basically boils down to the following: how could I write a Pipeline,
> that works like a terasort benchmark ([2]). That is - I have a randomly
> distributed dataset (let's suppose batch case for simplicity), and I want
> to sort it so that on output I will have N totally sorted partitions. This
> implies that I can somehow compare the partitions (or partition IDs) on
> output, so that the following holds: For each partitions X and Y, if
> partition X is less to partition Y, then all elements in partition X are
> less or equal to all elements in partition Y.
>
> So far, I have not been able to find a clean solution in Beam. I can do a
> group-by-key operation (where the *key* would be partition Id), and then
> sort the data within the key. But I have issues outputting the sorted data
> by a ParDo (because it can run in parallel in theory, and therefore I can
> either loose the sorting, or run to concurrency issues).
>
> Would anyone have an idea about how to do this?
>
> Thanks for any comments,
>
>  Jan
>
> [1] https://github.com/seznam/euphoria
>
> [2] https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/
> examples/terasort/package-summary.html
>
>

Reply via email to