I am experimenting with Spark 2.3.0 stream-stream join feature to see if I
can leverage it to replace some of our existing services.
Imagine I have 3 worker nodes with *each node* having (16GB RAM and 100GB
SSD). My input dataset which is in Kafka is about 250GB per day. Now I want
to do a stream-stream join across 8 data frames with a watermark set to 24
hours of Tumbling window. so, I need to hold state for 24 hours and then I
can clear all the data.
1) What happens if I can't fit data into memory while doing stream-stream
2) What Storage Level should I choose here for near optimal performance?
3) Any other suggestions?