Re: Heterogeneous or Dynamic Stream Processing

Rob Shepherd Tue, 07 Jul 2020 09:13:12 -0700

Very helpful thank you Arvid.

I've been reading up but I'm not sure I grasp all of that just yet.  Please
may I ask for clarification?


1. Could I summarise correctly that I may build a list of functions from an
SQL call which can then be looped over?
This looping sounds appealing and you are right that "1 or 100" is a big
bonus.

2. "during the start of the application and restart to reflect changes"
"during the start" do you mean when the job first boots, or immediately
upon ingress of the data event from the queue?
"restart" is this an API call to maybe abort an execution of a piece of
data but with more up-to-date context.


Trying to be a fast learner, and very grateful for the pointers.

With thanks and best regards

Rob




Rob Shepherd BEng PhD



On Tue, 7 Jul 2020 at 15:33, Arvid Heise <ar...@ververica.com> wrote:

> Hi Rob,
>
> In the past I used a mixture of configuration and template queries to
> achieve a similar goal (I had only up to 150 of these jobs per
> application). My approach was not completely dynamic as you have described
> but rather to compose a big query from a configuration during the start of
> the application and restart to reflect changes.
>
> For the simple extractor/mapper, I'd use Table API and plug in SQL
> statements [1] that could be easily given by experienced
> end-users/analysts. Abort logic should be added programmatically to each of
> the extractor/mapper through Table API (for example, extractor can output
> an error column that also gives an explanation and this column is then
> checked for non-null). The big advantage of using Table API over a simple
> SQL query is that you can add structural variance: your application may use
> 1 extractor or 100; it's just a matter of a loop.
>
> Note that async IO is currently not available in Table API, but you can
> easily switch back and forth between Table API and Datastream. I'd
> definitely suggest to use async IO for your described use cases.
>
> So please consider to also use that less dynamic approach; you'd get much
> for free: SQL support with proper query validation and meaningful error
> messages. And it's also much easier to test/debug.
>
>
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/common.html#sql
>
> On Tue, Jul 7, 2020 at 4:01 PM Rob Shepherd <rgsheph...@gmail.com> wrote:
>
>> Hi All,
>>
>> It'd be great to consider stream processing as a platform for our
>> upcoming projects. Flink seems to be the closeted match.
>>
>> However we have numerous stream processing workloads and would want to be
>> able to scale up to 1000's different streams;  each quite similar in
>> structure/sequence but with the functional logic being very different in
>> each.
>>
>> For example, there is always a "validate" stage - but what that means is
>> dependant on the client/data/context etc and would typically map to a few
>> line of script to perform.
>>
>> In essence, our sequences can often be deconstructed down to 8-12 python
>> snippets and the serverless/functional paradigm seems to fit well.
>>
>> Whilst we can deploy our functions readily to a faas/k8s or something
>> (which seems to fit the bill with remote functions) I don't yet see how to
>> quickly draw these together in a dynamic stream.
>>
>> My initial thoughts would be to create a very general purpose stream job
>> that then works through the context of mapping functions to flink tasks
>> based on the client dataset.
>>
>> E.g. some pseudo code:
>>
>> ingress()
>> extract()
>> getDynamicStreamFunctionDefs()
>> getFunction1()
>> runFunction1()
>> abortOnError()
>> getFunction2()
>> runFunction2()
>> abortOnError()
>> ...
>> getFunction10()
>> runFunction10()
>> sinkData()
>>
>> Most functions are not however simple lexical operations, or
>> extractors/mappers - but on the whole require a few database/API calls to
>> retrieve things like previous data, configurations etc.
>>
>> They are not necessarily long running but certainly Async is a
>> consideration.
>>
>> I think every stage will be UDFs (and then Meta-UDFs at that)
>>
>> As a result I'm not sure if we can get this to fit without a brittle set
>> of workarounds, and ruin any benefit of running through flink etc...
>> but it would great to hear opinions of others who might have tackled this
>> kind of dynamic tasking.
>>
>>
>> I'm happy to explain this better if it isn't clear.
>>
>> With best regards
>>
>> Rob
>>
>>
>>
>>
>> Rob Shepherd BEng PhD
>>
>>
>
> --
>
> Arvid Heise | Senior Java Developer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>

Re: Heterogeneous or Dynamic Stream Processing

Reply via email to