Hello, I'm interested in implementing a table/stream join, very similar to what is described in the "Table-stream join" section of the Samza key-value state documentation [0]. Conceptually, this would be an extension of the example provided in the javadocs for RichFunction#open [1], where I have a dataset of searchStrings instead of a single one. As per the Samza explanation, I would like to receive updates to this dataset via an operation log (a la, kafka topic), so as to update my local state while the streaming job runs.
Perhaps you can further advise on parallelization strategy for this operation. It seems to me that I'd want to partition the searchString database across multiple parallelization units and broadcast my input datastream to all those units. The idea being to maximize throughput on available hardware, though I would expect there to be a limit at which the network plane becomes a bottleneck to the broadcast. Is there an example of how I might implement this in Flink-Streaming? I thought perhaps the DataStream#cross transformation would work, but I haven't worked out how to use it to my purpose. Thus far, I'm using the Java API. Thanks a lot! -n [0]: http://samza.apache.org/learn/documentation/0.9/container/state-management.html [1]: https://ci.apache.org/projects/flink/flink-docs-release-0.9/api/java/org/apache/flink/api/common/functions/RichFunction.html#open(org.apache.flink.configuration.Configuration)