Re: Design question regarding streaming and sorting

Jan Lukavský Mon, 26 Aug 2019 01:33:14 -0700

Hi,

I have seen cases like that, and my general conclusion is that Beam (orFlink for that matter) is not the best match for solving this. Let meexplain:

a) Beam (and Flink) tries hard to deliver exactly-once semantics forthe data processing and this inevitably introduces some latency

b) each pipeline is organized so that it is essentially a bunch ofoperators (workers) that are orchestrated in a way to deliver a definedcomputation, but are checkpointed independently

c) the checkpoint barrier starts from source and the propagates to thesinks

The property c) is what stands in your way the most - to efficientlyimplement the pattern you described I believe it is essential to be ableto inverse the flow of the checkpoint barrier to go from sinks tosources (i.e. when a sink - your slow worker - finishes work, it cannotify parent operator - source in your case - to perform a checkpoint.This checkpoint can be a commit to Kafka). This subtle difference is akey for the pipeline to be able to buffer any realtime updates to thepriority of any row (by caching all pending work to memory) and deliver to the worker rows one-by-one ordered by priority at that the time theworker queries for more work. Moreover, it enables for the system to"reject" rows with too low priority in case it would overflow buffer(which will not happen in your case if you have <10k rows, I'd say).

So, I think that a simple KafkaConsumer would be more appropriate forwhat you described. It would be interesting to discuss what would beneeded on Beam side to support cases like that - essentially amicroservice - a different runner ([1]) than Flink is a first guess, butI have a feeling that there would be some changes needed in the model aswell.


Jan

[1] https://issues.apache.org/jira/browse/BEAM-2466

On 8/24/19 11:10 AM, Juan Carlos Garcia wrote:

Sorry I hit too soon, then after updating the priority I wouldexecute group by DoFn and then use an external sorting (by thispriority key) to guarantee that on the next DoFn you have a sortedIterable.

JC

Juan Carlos Garcia <jcgarc...@gmail.com <mailto:jcgarc...@gmail.com>>schrieb am Sa., 24. Aug. 2019, 11:08:


    Hi,

    The main puzzle here is how to deliver the priority change of a
    row during a given window, my best shot would be to have a side
    input (PCollectionView) containing the change of priority, then in
    the slow worker beam transform extract this side input and update
    the corresponding row with the new priority.

    Again is just an idea.

    JC

    Chad Dombrova <chad...@gmail.com <mailto:chad...@gmail.com>>
    schrieb am Sa., 24. Aug. 2019, 03:32:

        Hi all,
        Our team is brainstorming how to solve a particular type of
        problem with Beam, and it's a bit beyond our experience level,
        so I thought I'd turn to the experts for some advice.

        Here are the pieces of our puzzle:

          * a data source with the following properties:
              o rows represent work to do
              o each row has an integer priority
              o rows can be added or deleted
              o priorities of a row can be changed
              o <10k rows
          * a slow worker Beam transform (Map/ParDo) that consumes a
            row at a time


        We want a streaming pipeline that delivers rows from our data
        store to the worker transform, resorting the source based on
        priority each time a new row is delivered.  The goal is that
        last second changes in priority can affect the order of the
        slowly yielding read.  Throughput is not a major concern since
        the worker is the bottleneck.

        I have a few questions:

          * is the sort of problem that BeamSQL can solve? I'm not
            sure how sorting and resorting are handled there in a
            streaming context...
          * I'm unclear on how back-pressure in Flink affects
            streaming reads. It's my hope that data/messages are left
            in the data source until back-pressure subsides, rather
            than read eagerly into memory.  Can someone clarify this
            for me?
          * is there a combination of windowing and triggering that
            can solve this continual resorting plus slow yielding
            problem?  It's not unreasonable to keep all of our rows in
            memory on Flink, as long as we're snapshotting state.


        Any advice on how an expert Beamer would solve this is greatly
        appreciated!

        Thanks,
        chad

Re: Design question regarding streaming and sorting

Reply via email to