Hi,

I have seen cases like that, and my general conclusion is that Beam (or Flink for that matter) is not the best match for solving this. Let me explain:

 a) Beam (and Flink) tries hard to deliver exactly-once semantics for the data processing and this inevitably introduces some latency

 b) each pipeline is organized so that it is essentially a bunch of operators (workers) that are orchestrated in a way to deliver a defined computation, but are checkpointed independently

 c) the checkpoint barrier starts from source and the propagates to the sinks

The property c) is what stands in your way the most - to efficiently implement the pattern you described I believe it is essential to be able to inverse the flow of the checkpoint barrier to go from sinks to sources (i.e. when a sink - your slow worker - finishes work, it can notify parent operator - source in your case - to perform a checkpoint. This checkpoint can be a commit to Kafka). This subtle difference is a key for the pipeline to be able to buffer any realtime updates to the priority of any row (by caching all pending work to memory) and deliver  to the worker rows one-by-one ordered by priority at that the time the worker queries for more work. Moreover, it enables for the system to "reject" rows with too low priority in case it would overflow buffer (which will not happen in your case if you have <10k rows, I'd say).

So, I think that a simple KafkaConsumer would be more appropriate for what you described. It would be interesting to discuss what would be needed on Beam side to support cases like that - essentially a microservice - a different runner ([1]) than Flink is a first guess, but I have a feeling that there would be some changes needed in the model as well.

Jan

[1] https://issues.apache.org/jira/browse/BEAM-2466

On 8/24/19 11:10 AM, Juan Carlos Garcia wrote:
Sorry I hit too soon, then after updating the priority I would execute  group by DoFn and then use an external sorting (by this priority key) to guarantee that on the next DoFn you have a sorted Iterable.


JC

Juan Carlos Garcia <jcgarc...@gmail.com <mailto:jcgarc...@gmail.com>> schrieb am Sa., 24. Aug. 2019, 11:08:

    Hi,

    The main puzzle here is how to deliver the priority change of a
    row during a given window, my best shot would be to have a side
    input (PCollectionView) containing the change of priority, then in
    the slow worker beam transform extract this side input and update
    the corresponding row with the new priority.

    Again is just an idea.

    JC

    Chad Dombrova <chad...@gmail.com <mailto:chad...@gmail.com>>
    schrieb am Sa., 24. Aug. 2019, 03:32:

        Hi all,
        Our team is brainstorming how to solve a particular type of
        problem with Beam, and it's a bit beyond our experience level,
        so I thought I'd turn to the experts for some advice.

        Here are the pieces of our puzzle:

          * a data source with the following properties:
              o rows represent work to do
              o each row has an integer priority
              o rows can be added or deleted
              o priorities of a row can be changed
              o <10k rows
          * a slow worker Beam transform (Map/ParDo) that consumes a
            row at a time


        We want a streaming pipeline that delivers rows from our data
        store to the worker transform, resorting the source based on
        priority each time a new row is delivered.  The goal is that
        last second changes in priority can affect the order of the
        slowly yielding read.  Throughput is not a major concern since
        the worker is the bottleneck.

        I have a few questions:

          * is the sort of problem that BeamSQL can solve? I'm not
            sure how sorting and resorting are handled there in a
            streaming context...
          * I'm unclear on how back-pressure in Flink affects
            streaming reads. It's my hope that data/messages are left
            in the data source until back-pressure subsides, rather
            than read eagerly into memory.  Can someone clarify this
            for me?
          * is there a combination of windowing and triggering that
            can solve this continual resorting plus slow yielding
            problem?  It's not unreasonable to keep all of our rows in
            memory on Flink, as long as we're snapshotting state.


        Any advice on how an expert Beamer would solve this is greatly
        appreciated!

        Thanks,
        chad

Reply via email to