Hi All, I am experimenting with Spark 2.3.0 stream-stream join feature to see if I can leverage it to replace some of our existing services.
Imagine I have 3 worker nodes with *each node* having (16GB RAM and 100GB SSD). My input dataset which is in Kafka is about 250GB per day. Now I want to do a stream-stream join across 8 data frames with a watermark set to 24 hours of Tumbling window. so, I need to hold state for 24 hours and then I can clear all the data. Questions: 1) What happens if I can't fit data into memory while doing stream-stream join? 2) What Storage Level should I choose here for near optimal performance? 3) Any other suggestions? Thanks!