Design question regarding streaming and sorting

Chad Dombrova Fri, 23 Aug 2019 18:33:06 -0700

Hi all,
Our team is brainstorming how to solve a particular type of problem with
Beam, and it's a bit beyond our experience level, so I thought I'd turn to
the experts for some advice.


Here are the pieces of our puzzle:

   - a data source with the following properties:
      - rows represent work to do
      - each row has an integer priority
      - rows can be added or deleted
      - priorities of a row can be changed
      - <10k rows
   - a slow worker Beam transform (Map/ParDo) that consumes a row at a time


We want a streaming pipeline that delivers rows from our data store to the
worker transform,  resorting the source based on priority each time a new
row is delivered.  The goal is that last second changes in priority can
affect the order of the slowly yielding read.  Throughput is not a major
concern since the worker is the bottleneck.

I have a few questions:

   - is the sort of problem that BeamSQL can solve? I'm not sure how
   sorting and resorting are handled there in a streaming context...
   - I'm unclear on how back-pressure in Flink affects streaming reads.
   It's my hope that data/messages are left in the data source until
   back-pressure subsides, rather than read eagerly into memory.  Can someone
   clarify this for me?
   - is there a combination of windowing and triggering that can solve this
   continual resorting plus slow yielding problem?  It's not unreasonable to
   keep all of our rows in memory on Flink, as long as we're snapshotting
   state.


Any advice on how an expert Beamer would solve this is greatly appreciated!

Thanks,
chad

Design question regarding streaming and sorting

Reply via email to