I just finished watching a great presentation from a recent spark summit on real time movie recommendations using spark. https://spark-summit.org/east-2015/talk/real-time-recommendations-using-spar k . For the purpose of email I am going to really simplify what they did. In general their real time system took in data about what all users are watching and calculates the most popular/trending shows. The results are stored in a data base. When an individual user goes to ³movie guide² they read the top 10 recommendations from a database.
My guess is the part of their system that services up recommendations to users in real time is not implemented using spark. Its probably a bunch of rest servers sitting behind a bunch of proxy servers and load balancers. The rest servers read the recommendations calculated using spark streaming. This got me thinking. So in general we have spark handling batch, ingestion of real time data but not the part of the system that delivers the real time user experience. Ideally I would like to have one unified platform. Using spark streaming with a small window size of say 100 ms would meet my SLA. Each window is going to contain many unrelated requests. In the recommender system example map() would look up the user specific recommendation for each request. The trick is how to return the response to the correctly ³client². I could publish the response to some other system (kafka? Or custom proxy?) that can truly return the data to the client. Is this a good idea? What do people do in practice? Also I assume I would have to use rdd.foreach() to some how mark the cause the response data to be sent to the correct client. Comments and suggestions appreciated. Kind regards Andy