Once you choose to start processing a row, can a different row preempt that work or when you start processing you'll finish it?
So far, I'm agreeing with what Jan has said. I believe there is a way to get what you want working but with a lot of unneeded complexity since not much in your problem description lines up with how a data parallel processing system would be beneficial. On Mon, Aug 26, 2019 at 1:33 AM Jan Lukavský <je...@seznam.cz> wrote: > Hi, > > I have seen cases like that, and my general conclusion is that Beam (or > Flink for that matter) is not the best match for solving this. Let me > explain: > > a) Beam (and Flink) tries hard to deliver exactly-once semantics for the > data processing and this inevitably introduces some latency > > b) each pipeline is organized so that it is essentially a bunch of > operators (workers) that are orchestrated in a way to deliver a defined > computation, but are checkpointed independently > > c) the checkpoint barrier starts from source and the propagates to the > sinks > > The property c) is what stands in your way the most - to efficiently > implement the pattern you described I believe it is essential to be able to > inverse the flow of the checkpoint barrier to go from sinks to sources > (i.e. when a sink - your slow worker - finishes work, it can notify parent > operator - source in your case - to perform a checkpoint. This checkpoint > can be a commit to Kafka). This subtle difference is a key for the pipeline > to be able to buffer any realtime updates to the priority of any row (by > caching all pending work to memory) and deliver to the worker rows > one-by-one ordered by priority at that the time the worker queries for more > work. Moreover, it enables for the system to "reject" rows with too low > priority in case it would overflow buffer (which will not happen in your > case if you have <10k rows, I'd say). > > So, I think that a simple KafkaConsumer would be more appropriate for what > you described. It would be interesting to discuss what would be needed on > Beam side to support cases like that - essentially a microservice - a > different runner ([1]) than Flink is a first guess, but I have a feeling > that there would be some changes needed in the model as well. > > Jan > > [1] https://issues.apache.org/jira/browse/BEAM-2466 > On 8/24/19 11:10 AM, Juan Carlos Garcia wrote: > > Sorry I hit too soon, then after updating the priority I would execute > group by DoFn and then use an external sorting (by this priority key) to > guarantee that on the next DoFn you have a sorted Iterable. > > > JC > > Juan Carlos Garcia <jcgarc...@gmail.com> schrieb am Sa., 24. Aug. 2019, > 11:08: > >> Hi, >> >> The main puzzle here is how to deliver the priority change of a row >> during a given window, my best shot would be to have a side input >> (PCollectionView) containing the change of priority, then in the slow >> worker beam transform extract this side input and update the corresponding >> row with the new priority. >> >> Again is just an idea. >> >> JC >> >> Chad Dombrova <chad...@gmail.com> schrieb am Sa., 24. Aug. 2019, 03:32: >> >>> Hi all, >>> Our team is brainstorming how to solve a particular type of problem with >>> Beam, and it's a bit beyond our experience level, so I thought I'd turn to >>> the experts for some advice. >>> >>> Here are the pieces of our puzzle: >>> >>> - a data source with the following properties: >>> - rows represent work to do >>> - each row has an integer priority >>> - rows can be added or deleted >>> - priorities of a row can be changed >>> - <10k rows >>> - a slow worker Beam transform (Map/ParDo) that consumes a row at a >>> time >>> >>> >>> We want a streaming pipeline that delivers rows from our data store to >>> the worker transform, resorting the source based on priority each time a >>> new row is delivered. The goal is that last second changes in priority can >>> affect the order of the slowly yielding read. Throughput is not a major >>> concern since the worker is the bottleneck. >>> >>> I have a few questions: >>> >>> - is the sort of problem that BeamSQL can solve? I'm not sure how >>> sorting and resorting are handled there in a streaming context... >>> - I'm unclear on how back-pressure in Flink affects streaming reads. >>> It's my hope that data/messages are left in the data source until >>> back-pressure subsides, rather than read eagerly into memory. Can >>> someone >>> clarify this for me? >>> - is there a combination of windowing and triggering that can solve >>> this continual resorting plus slow yielding problem? It's not >>> unreasonable >>> to keep all of our rows in memory on Flink, as long as we're snapshotting >>> state. >>> >>> >>> Any advice on how an expert Beamer would solve this is greatly >>> appreciated! >>> >>> Thanks, >>> chad >>> >>>