Thanks a lot! got it :) On Wed, Nov 21, 2018 at 11:40 PM Jamie Grier <jgr...@lyft.com> wrote:
> Hi Avi, > > The typical approach would be as you've described in #1. #2 is not > necessary -- #1 is already doing basically exactly that. > > -Jamie > > > On Wed, Nov 21, 2018 at 3:36 AM Avi Levi <avi.l...@bluevoyant.com> wrote: > >> Hi , >> I am very new to flink so please be gentle :) >> >> *The challenge:* >> I have a road sensor that should scan billons of cars per day. for >> starter I want to recognise if each car that passes by is new or not. new >> cars (never been seen before by that sensor ) will be placed on a different >> topic on kafka than the other (total of two topics for new and old) . >> under the assumption that the state will contain billions of unique car >> ids. >> >> *Suggested Solutions* >> My question is it which approach is better. >> Both approaches using RocksDB >> >> 1. use the ValueState and to split the steam like >> *val domainsSrc = env* >> * .addSource(consumer)* >> * .keyBy(car => car.id <http://car.id>)* >> * .map(...)* >> and checking if the state value is null to recognise new cars. if new >> than I will update the state >> how will the persistent data will be shard among the nodes in the cluster >> (let's say that I have 10 nodes) ? >> >> 2. use MapState and to partition the stream to groups by some arbitrary >> factor e.g >> *val domainsSrc = env* >> * .addSource(consumer)* >> * .keyBy{ car =>* >> * val h car.id.hashCode % partitionFactor* >> * math.abs(h)* >> * } .map(...)* >> and to check *mapState.keys.contains(car.id <http://car.id>) *if not - >> add it to the state >> >> which approach is better ? >> >> Thanks in advance >> Avi >> >