Hi everyone, Currently I am working on parallelizing a machine learning algorithm using a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop MapReduce, but since my algorithm is iterative the job scheduling overhead and data loading overhead severely limits the performance of my algorithm in terms of training time.
Since recently, HDInsight supports Hadoop 2 with YARN, which I thought would allow me to use run Spark jobs, which seem more fitting for my task. So far I have not been able however to find how I can run Apache Spark jobs on a HDInsight cluster. It seems like remote job submission (which would have my preference) is not possible for Spark on HDInsight, as REST endpoints for Oozie and templeton do not seem to support submission of Spark jobs. I also tried to RDP to the headnode for job submission from the headnode. On the headnode drives I can find other new YARN computation models like Tez and I also managed to run Tez jobs on it through YARN. However, Spark seems to be missing. Does this mean that HDInsight currently does not support Spark, even though it supports Hadoop versions with YARN? Or do I need to install Spark on the HDInsight cluster first, in some way? Or is there maybe something else that I'm missing and can I run Spark jobs on HDInsight some other way? Many thanks in advance! Kind regards, Niek Tax