Hi everyone,

Currently I am working on parallelizing a machine learning algorithm using
a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
MapReduce, but since my algorithm is iterative the job scheduling overhead
and data loading overhead severely limits the performance of my algorithm
in terms of training time.

Since recently, HDInsight supports Hadoop 2 with YARN, which I thought
would allow me to use run Spark jobs, which seem more fitting for my task. So
far I have not been able however to find how I can run Apache Spark jobs on
a HDInsight cluster.

It seems like remote job submission (which would have my preference) is not
possible for Spark on HDInsight, as REST endpoints for Oozie and templeton
do not seem to support submission of Spark jobs. I also tried to RDP to the
headnode for job submission from the headnode. On the headnode drives I can
find other new YARN computation models like Tez and I also managed to run
Tez jobs on it through YARN. However, Spark seems to be missing. Does this
mean that HDInsight currently does not support Spark, even though it
supports Hadoop versions with YARN? Or do I need to install Spark on the
HDInsight cluster first, in some way? Or is there maybe something else that
I'm missing and can I run Spark jobs on HDInsight some other way?

Many thanks in advance!


Kind regards,

Niek Tax

Reply via email to