Re: Communicating with my operators

Chesnay Schepler Wed, 15 Jul 2020 02:48:21 -0700

Using an S3 bucket containing the configuration is the way to go.

1) web sockets, or more generally all approaches where you connect tothe source

The JobGraph won't help you; it doesn't contain the information on wheretasks are deployed to at runtime. It is just an abstract representationof your job.

You could theoretically retrieve the actual location through the RESTAPI, and maybe expose the port as a metric.

But then you still have to deal with resolving IPs, internal/externalIPs and all that jazz.


2) CoProcessFunction

We still have to get the data in somehow; so you'd need to have somesource in any case :)


3) ParameterTool

This is really just a parsing tool, so it won't help for this use-case.

4) State Processing API

A bit too complicated. If restarting jobs is an option, you could justencode the commands into the source, emit them as an event of sort, andthe process function updates it's state on reception of these events.


On 15/07/2020 10:00, Tom Wells wrote:

Hi Everyone
I'm looking for some advice on designing my operators (whichunsurprisingly tend to take the form of SourceFunctions,ProcessFunctions or SinkFunctions) to allow them to be "dynamicallyconfigured" while running.
By way of example, I have a SourceFunction which collects the names ofvarious S3 buckets, and then a ProcessFunction which reads andcollects their contents. The gotcha is that the list of S3 buckets isnot fixed, and can be changed during the lifetime of the job. Thisadd/remove action would be done by some human administrator, and letssay using a simple command line tool.
For example - here is an idea of what I want to build to "communicate"with my flink job:
```
# Add a bucket to the flink job to process
$ ./admin-tool add-bucket --name my-s3-bucket --region eu-west-1--access-key <blah>...
# Get a list of the s3 buckets we're currently processing, and whenlast they were last accessed
$ ./admin-tool list-buckets
my-s3-bucket | eu-west-1 | 5 seconds ago

# Remove buckets
$ ./admin-tool remove-bucket --name my-s3-bucket
```
Hope that gives you an idea - of course this could apply to any numberof different source types, and could even extend to configuration ofsinks etc too.
So - how should my command line tool communicate with my operators?

4 alternative approaches I've thought about:
- Have a SourceFunction open a websocket and listen for bucketadd/remove commands (written to by the command line tool). I thinkthis would work, but the difficulty is in figuring out where exactlythe SourceFunction might be deployed in the flink cluster to find thewebsocket listening port. I took a look at the ClusterClient API andit's possibly available by inspecting the JobGraph... I'm just notsure if this is an anti-pattern?
- Use a CoProcessFunction instead, and have it be joined with aDataStream that I can somehow write to directly from the command linetool (maybe using flink-client api - can i write to a DataStreamdirectly??). I don't think this is possible but would feel like a goodclean approach?
- Somehow using the ParameterTool. I don't think it supports a dynamicuse-case though?
- Writing directly to the saved state of a ProcessFunction to add theremove bucket names. I'm pretty unfamiliar with this approach - butlooks possible according to the docs on the State Processor API -however it seems like I would have to read the savepoint, write theupdates, then restore from savepoint which may mean suspending andresuming the job while that happens. Not really an issue for me, butdoes feel like possibly the wrong approach for my simple requirement.
- Solve it just using datasources - e.g. create a centrally read s3bucket which holds the latest configuration and is sourced and joinedby every operator (probably using Broadcast State). My command linetool would then just have to write to that S3 bucket - no need tocommunicate directly with the operators.
The last option is fairly obvious and probably my default approach -I'm just wondering if whether any of the alternatives above are worthinvestigating. (Especially considering my endless quest to learneverything about Flink - i don't mind exploring the less obviouspathways).
I would love some guidance or advice on what you feel is the bestapproach / idiomatic approach for this.
All the best,
Tom

Re: Communicating with my operators

Reply via email to