Implementing samza table/stream join

Nick Dimiduk Tue, 10 Nov 2015 10:03:02 -0800

Hello,

I'm interested in implementing a table/stream join, very similar to what is
described in the "Table-stream join" section of the Samza key-value state
documentation [0]. Conceptually, this would be an extension of the example
provided in the javadocs for RichFunction#open [1], where I have a dataset
of searchStrings instead of a single one. As per the Samza explanation, I
would like to receive updates to this dataset via an operation log (a la,
kafka topic), so as to update my local state while the streaming job runs.


Perhaps you can further advise on parallelization strategy for this
operation. It seems to me that I'd want to partition the searchString
database across multiple parallelization units and broadcast my input
datastream to all those units. The idea being to maximize throughput on
available hardware, though I would expect there to be a limit at which the
network plane becomes a bottleneck to the broadcast.

Is there an example of how I might implement this in Flink-Streaming? I
thought perhaps the DataStream#cross transformation would work, but I
haven't worked out how to use it to my purpose. Thus far, I'm using the
Java API.

Thanks a lot!
-n

[0]:
http://samza.apache.org/learn/documentation/0.9/container/state-management.html
[1]:
https://ci.apache.org/projects/flink/flink-docs-release-0.9/api/java/org/apache/flink/api/common/functions/RichFunction.html#open(org.apache.flink.configuration.Configuration)

Implementing samza table/stream join

Reply via email to