We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes incrementally on a daily basis. (that detail is relevant to question 3 below)
Now we are ready to start building out the front-end, which will allow a team of data scientists to submit their problems to the system via a web front-end (web tier will be java). Users could of course be submitting jobs more or less simultaneously. We want to make sure we understand how to best structure this. Questions: 1 - Does a new SparkContext get created in the web tier for each new request for processing? 2 - If so, how much time should we expect it to take for setting up the context? Our goal is to return a response to the users in under 10 seconds, but if it takes many seconds to create a new context or otherwise set up the job, then we need to adjust our expectations for what is possible. From using spark-shell one might conclude that it might take more than 10 seconds to create a context, however it's not clear how much of that is context-creation vs other things. 3 - (This last question perhaps deserves a post in and of itself:) if every job is always comparing some little data structure to the same HDFS corpus of data, what is the best pattern to use to cache the RDD's from HDFS so they don't have to always be re-constituted from disk? I.e. how can RDD's be "shared" from the context of one job to the context of subsequent jobs? Or does something like memcache have to be used? Thanks! David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org