Re: Design question regarding streaming and sorting

Lukasz Cwik Mon, 26 Aug 2019 08:04:12 -0700

Once you choose to start processing a row, can a different row preempt that
work or when you start processing you'll finish it?


So far, I'm agreeing with what Jan has said. I believe there is a way to
get what you want working but with a lot of unneeded complexity since not
much in your problem description lines up with how a data parallel
processing system would be beneficial.

On Mon, Aug 26, 2019 at 1:33 AM Jan Lukavský <je...@seznam.cz> wrote:

> Hi,
>
> I have seen cases like that, and my general conclusion is that Beam (or
> Flink for that matter) is not the best match for solving this. Let me
> explain:
>
>  a) Beam (and Flink) tries hard to deliver exactly-once semantics for the
> data processing and this inevitably introduces some latency
>
>  b) each pipeline is organized so that it is essentially a bunch of
> operators (workers) that are orchestrated in a way to deliver a defined
> computation, but are checkpointed independently
>
>  c) the checkpoint barrier starts from source and the propagates to the
> sinks
>
> The property c) is what stands in your way the most - to efficiently
> implement the pattern you described I believe it is essential to be able to
> inverse the flow of the checkpoint barrier to go from sinks to sources
> (i.e. when a sink - your slow worker - finishes work, it can notify parent
> operator - source in your case - to perform a checkpoint. This checkpoint
> can be a commit to Kafka). This subtle difference is a key for the pipeline
> to be able to buffer any realtime updates to the priority of any row (by
> caching all pending work to memory) and deliver  to the worker rows
> one-by-one ordered by priority at that the time the worker queries for more
> work. Moreover, it enables for the system to "reject" rows with too low
> priority in case it would overflow buffer (which will not happen in your
> case if you have <10k rows, I'd say).
>
> So, I think that a simple KafkaConsumer would be more appropriate for what
> you described. It would be interesting to discuss what would be needed on
> Beam side to support cases like that - essentially a microservice - a
> different runner ([1]) than Flink is a first guess, but I have a feeling
> that there would be some changes needed in the model as well.
>
> Jan
>
> [1] https://issues.apache.org/jira/browse/BEAM-2466
> On 8/24/19 11:10 AM, Juan Carlos Garcia wrote:
>
> Sorry I hit too soon, then after updating the priority I would execute
> group by DoFn and then use an external sorting (by this priority key) to
> guarantee that on the next DoFn you have a sorted Iterable.
>
>
> JC
>
> Juan Carlos Garcia <jcgarc...@gmail.com> schrieb am Sa., 24. Aug. 2019,
> 11:08:
>
>> Hi,
>>
>> The main puzzle here is how to deliver the priority change of a row
>> during a given window, my best shot would be to have a side input
>> (PCollectionView) containing the change of priority, then in the slow
>> worker beam transform extract this side input and update the corresponding
>> row with the new priority.
>>
>> Again is just an idea.
>>
>> JC
>>
>> Chad Dombrova <chad...@gmail.com> schrieb am Sa., 24. Aug. 2019, 03:32:
>>
>>> Hi all,
>>> Our team is brainstorming how to solve a particular type of problem with
>>> Beam, and it's a bit beyond our experience level, so I thought I'd turn to
>>> the experts for some advice.
>>>
>>> Here are the pieces of our puzzle:
>>>
>>>    - a data source with the following properties:
>>>       - rows represent work to do
>>>       - each row has an integer priority
>>>       - rows can be added or deleted
>>>       - priorities of a row can be changed
>>>       - <10k rows
>>>    - a slow worker Beam transform (Map/ParDo) that consumes a row at a
>>>    time
>>>
>>>
>>> We want a streaming pipeline that delivers rows from our data store to
>>> the worker transform,  resorting the source based on priority each time a
>>> new row is delivered.  The goal is that last second changes in priority can
>>> affect the order of the slowly yielding read.  Throughput is not a major
>>> concern since the worker is the bottleneck.
>>>
>>> I have a few questions:
>>>
>>>    - is the sort of problem that BeamSQL can solve? I'm not sure how
>>>    sorting and resorting are handled there in a streaming context...
>>>    - I'm unclear on how back-pressure in Flink affects streaming reads.
>>>    It's my hope that data/messages are left in the data source until
>>>    back-pressure subsides, rather than read eagerly into memory.  Can 
>>> someone
>>>    clarify this for me?
>>>    - is there a combination of windowing and triggering that can solve
>>>    this continual resorting plus slow yielding problem?  It's not 
>>> unreasonable
>>>    to keep all of our rows in memory on Flink, as long as we're snapshotting
>>>    state.
>>>
>>>
>>> Any advice on how an expert Beamer would solve this is greatly
>>> appreciated!
>>>
>>> Thanks,
>>> chad
>>>
>>>

Re: Design question regarding streaming and sorting

Reply via email to