Hi,
I have seen cases like that, and my general conclusion is that Beam (or
Flink for that matter) is not the best match for solving this. Let me
explain:
a) Beam (and Flink) tries hard to deliver exactly-once semantics for
the data processing and this inevitably introduces some latency
b) each pipeline is organized so that it is essentially a bunch of
operators (workers) that are orchestrated in a way to deliver a defined
computation, but are checkpointed independently
c) the checkpoint barrier starts from source and the propagates to the
sinks
The property c) is what stands in your way the most - to efficiently
implement the pattern you described I believe it is essential to be able
to inverse the flow of the checkpoint barrier to go from sinks to
sources (i.e. when a sink - your slow worker - finishes work, it can
notify parent operator - source in your case - to perform a checkpoint.
This checkpoint can be a commit to Kafka). This subtle difference is a
key for the pipeline to be able to buffer any realtime updates to the
priority of any row (by caching all pending work to memory) and deliver
to the worker rows one-by-one ordered by priority at that the time the
worker queries for more work. Moreover, it enables for the system to
"reject" rows with too low priority in case it would overflow buffer
(which will not happen in your case if you have <10k rows, I'd say).
So, I think that a simple KafkaConsumer would be more appropriate for
what you described. It would be interesting to discuss what would be
needed on Beam side to support cases like that - essentially a
microservice - a different runner ([1]) than Flink is a first guess, but
I have a feeling that there would be some changes needed in the model as
well.
Jan
[1] https://issues.apache.org/jira/browse/BEAM-2466
On 8/24/19 11:10 AM, Juan Carlos Garcia wrote:
Sorry I hit too soon, then after updating the priority I would
execute group by DoFn and then use an external sorting (by this
priority key) to guarantee that on the next DoFn you have a sorted
Iterable.
JC
Juan Carlos Garcia <jcgarc...@gmail.com <mailto:jcgarc...@gmail.com>>
schrieb am Sa., 24. Aug. 2019, 11:08:
Hi,
The main puzzle here is how to deliver the priority change of a
row during a given window, my best shot would be to have a side
input (PCollectionView) containing the change of priority, then in
the slow worker beam transform extract this side input and update
the corresponding row with the new priority.
Again is just an idea.
JC
Chad Dombrova <chad...@gmail.com <mailto:chad...@gmail.com>>
schrieb am Sa., 24. Aug. 2019, 03:32:
Hi all,
Our team is brainstorming how to solve a particular type of
problem with Beam, and it's a bit beyond our experience level,
so I thought I'd turn to the experts for some advice.
Here are the pieces of our puzzle:
* a data source with the following properties:
o rows represent work to do
o each row has an integer priority
o rows can be added or deleted
o priorities of a row can be changed
o <10k rows
* a slow worker Beam transform (Map/ParDo) that consumes a
row at a time
We want a streaming pipeline that delivers rows from our data
store to the worker transform, resorting the source based on
priority each time a new row is delivered. The goal is that
last second changes in priority can affect the order of the
slowly yielding read. Throughput is not a major concern since
the worker is the bottleneck.
I have a few questions:
* is the sort of problem that BeamSQL can solve? I'm not
sure how sorting and resorting are handled there in a
streaming context...
* I'm unclear on how back-pressure in Flink affects
streaming reads. It's my hope that data/messages are left
in the data source until back-pressure subsides, rather
than read eagerly into memory. Can someone clarify this
for me?
* is there a combination of windowing and triggering that
can solve this continual resorting plus slow yielding
problem? It's not unreasonable to keep all of our rows in
memory on Flink, as long as we're snapshotting state.
Any advice on how an expert Beamer would solve this is greatly
appreciated!
Thanks,
chad