Hi,
thanks for answer. I understand that Beam does not want to incorporate
in the model a way to handle parallelism (because it is left to the
runner to decide, which I find good). But there are some use-cases where
it would be beneficial to force *sequential* processing. That is to make
sure that certain PCollection (or, to state it exactly, each window of a
PCollection) is processed entirely by a single (fault tolerant)
instance. The terasort pipeline would then be realizable and I don't
think that even affects the runners so much. Many of them (actuall all I
know :)) nevertheless have this option to process a "partition" by a
single "mapper" or "processor".
Would it be possible to add a sequential form of ParDo into the model?
Or is it strictly against the philosophy?
Jan
On 07/19/2017 10:48 PM, Vikas RK wrote:
The Beam model doesn't support global sorting, [1] discusses in detail
that you might find useful.
[1]
https://lists.apache.org/thread.html/bc0e65a3bb653b8fd0db96bcd4c9da5af71a71af5a5639a472167808@1464278191@%3Cdev.beam.apache.org%3E
On 19 July 2017 at 02:45, Jan Lukavský <[email protected]
<mailto:[email protected]>> wrote:
Hi all,
I'm trying to get better understanding of Beam's internals for the
sake of integration with Euphoria API as a DSL ([1]), and while
trying to wrap Euphoria's abstractions of outputs, I came across a
little issue, that I'm currently a little stuck with. The issue is
not important to this question, but it basically boils down to the
following: how could I write a Pipeline, that works like a
terasort benchmark ([2]). That is - I have a randomly distributed
dataset (let's suppose batch case for simplicity), and I want to
sort it so that on output I will have N totally sorted partitions.
This implies that I can somehow compare the partitions (or
partition IDs) on output, so that the following holds: For each
partitions X and Y, if partition X is less to partition Y, then
all elements in partition X are less or equal to all elements in
partition Y.
So far, I have not been able to find a clean solution in Beam. I
can do a group-by-key operation (where the *key* would be
partition Id), and then sort the data within the key. But I have
issues outputting the sorted data by a ParDo (because it can run
in parallel in theory, and therefore I can either loose the
sorting, or run to concurrency issues).
Would anyone have an idea about how to do this?
Thanks for any comments,
Jan
[1] https://github.com/seznam/euphoria
<https://github.com/seznam/euphoria>
[2]
https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html
<https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html>