Thanks Garvit for your suggestion!

From: Garvit Sharma <garvit...@gmail.com>
Date: Tuesday, June 5, 2018 at 8:44 AM
To: "Sandybayev, Turar (CAI - Atlanta)" <turar.sandyba...@coxautoinc.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: Implementing a “join” between a DataStream and a “set of rules”

Hi,

For the above use case, you should do the following :

1. Convert your DataStream into KeyedDataStream by defining a key which would 
be used to get validated against your rules.
2. Same as 1 for rules stream.
3. Join the two keyedStreams using Flink's connect operator.
4. Store the rules into Flink's internal state i,e. Flink's managed keyed state.
5. Validate the data coming in the dataStream against the managed keyed state.

Refer to [1] [2] for more details.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/operators/
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/state/state.html



On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta) 
<turar.sandyba...@coxautoinc.com<mailto:turar.sandyba...@coxautoinc.com>> wrote:
Hi,

What is the best practice recommendation for the following use case? We need to 
match a stream against a set of “rules”, which are essentially a Flink DataSet 
concept. Updates to this “rules set" are possible but not frequent. Each stream 
event must be checked against all the records in “rules set”, and each match 
produces one or more events into a sink. Number of records in a rule set are in 
the 6 digit range.

Currently we're simply loading rules into a local List of rules and using 
flatMap over an incoming DataStream. Inside flatMap, we're just iterating over 
a list comparing each event to each rule.

To speed up the iteration, we can also split the list into several batches, 
essentially creating a list of lists, and creating a separate thread to iterate 
over each sub-list (using Futures in either Java or Scala).

Questions:
1.            Is there a better way to do this kind of a join?
2.            If not, is it safe to add additional parallelism by creating new 
threads inside each flatMap operation, on top of what Flink is already doing?

Thanks in advance!
Turar




--

Garvit Sharma
github.com/garvitlnmiit/<http://github.com/garvitlnmiit/>

No Body is a Scholar by birth, its only hard work and strong determination that 
makes him master.

Reply via email to