Hi folks, I am coding a streaming task that processes http requests from our web site and enriches these with additional information.
It contains session ids from historic requests and the related emails that were used within these session in the past. lookup - hashtable: session_id: String => emails: Set[String] During processing of these NEW http request - the lookup table should be used to get previous emails and enrich the current stream item - new candidates for the lookup table will be discovered during processing of these items and should be added to the lookup table (also these changes should be visible through the cluster) I see at least the following issues: (1) load the state as a whole from the data store into memory is a huge burn of memory (also making changes cluster-wide visible is an issue) (2) not loading into memory but using something like cassandra / redis as a lookup store would certainly work but introduces a lot of network requests (possible ideas: use a distributed cache? broadcast updates in flink cluster?) (3) how should I integrate the changes to the table with flink's checkpointing? I really don't get how to solve this best and my current solution is far from elegant.... So is there any best practice for supporting "large lookup tables that change during stream processing" ? Cheers Peter