The Beam model doesn't support global sorting, [1] discusses in detail that you might find useful.
[1] https://lists.apache.org/thread.html/bc0e65a3bb653b8fd0db96bcd4c9da5af71a71af5a5639a472167808@1464278191@%3Cdev.beam.apache.org%3E On 19 July 2017 at 02:45, Jan Lukavský <[email protected]> wrote: > Hi all, > > I'm trying to get better understanding of Beam's internals for the sake of > integration with Euphoria API as a DSL ([1]), and while trying to wrap > Euphoria's abstractions of outputs, I came across a little issue, that I'm > currently a little stuck with. The issue is not important to this question, > but it basically boils down to the following: how could I write a Pipeline, > that works like a terasort benchmark ([2]). That is - I have a randomly > distributed dataset (let's suppose batch case for simplicity), and I want > to sort it so that on output I will have N totally sorted partitions. This > implies that I can somehow compare the partitions (or partition IDs) on > output, so that the following holds: For each partitions X and Y, if > partition X is less to partition Y, then all elements in partition X are > less or equal to all elements in partition Y. > > So far, I have not been able to find a clean solution in Beam. I can do a > group-by-key operation (where the *key* would be partition Id), and then > sort the data within the key. But I have issues outputting the sorted data > by a ParDo (because it can run in parallel in theory, and therefore I can > either loose the sorting, or run to concurrency issues). > > Would anyone have an idea about how to do this? > > Thanks for any comments, > > Jan > > [1] https://github.com/seznam/euphoria > > [2] https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/ > examples/terasort/package-summary.html > >
