Hi Eric, This sounds like a use case for BroadcastProcessFunction <https://ci.apache.org/projects/flink/flink-docs-stable/api/java/org/apache/flink/streaming/api/functions/co/BroadcastProcessFunction.html> You’d use the Cassandra dataset as the source for the broadcast stream, which is distributed to every parallel instance of your custom BroadcastProcessFunction. The input vectors are a partitioned stream that’s the other input to this function (via its processElement() method). The two streams get connected as a BroadcastConnectedStream <https://ci.apache.org/projects/flink/flink-docs-stable/api/java/org/apache/flink/streaming/api/datastream/BroadcastConnectedStream.html>.
Note that as of Flink 1.5 it’s also easy to maintain the broadcast state <https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html>. — Ken > On Nov 14, 2018, at 11:32 PM, eric hoffmann <sigfrid.hoffm...@gmail.com > <mailto:sigfrid.hoffm...@gmail.com>> wrote: > > > Hi. > I need to compute an euclidian distance between an input Vector and a full > dataset stored in Cassandra and keep the n lowest value. The Cassandra > dataset is evolving (mutable). I could do this on a batch job, but i will > have to triger it each time and the input are more like a slow stream, but > the computing need to be fast can i do this on a stream way? is there any > better solution ? > Thx -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com <http://www.scaleunlimited.com/> Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra