If CDH4.2.0 is used in the EC2 scripts, then that's probably fine as it is.
Users will use our product, which depends on Spark built against CDH4. Our build system makes it infeasible to support pluggable hadoop versions - we declare the dependency on a specific version of hadoop in our project's ivy xml file. We also want to give users the EC2 scripts as an easy way to get started with setting up a Spark cluster. The EC2 scripts would ideally set up the cluster with CDH4, the version of Hadoop that our version of the product is built against. -Matt Cheah On 11/26/13 12:10 PM, "Patrick Wendell" <[email protected]> wrote: >> Is it possible for the spark EC2 scripts to deploy clusters set up with >> Cloudera's CDH4 hadoop distribution, as opposed to the default hadoop >> distributions? > >No this isn't possible. The EC2 scripts are designed to launch a cluster >from >scratch with a specific configuration. However, if you've already >gotten a CDH4 cluster running then you can just install Spark on that >cluster. > >> As well, if an existing cluster is running Hadoop with CDH4, and Spark >>is >> compiled against the (non-Cloudera) Hadoop 2 to run the Spark daemons on >> that cluster, will there be any problems getting Spark to communicate >>with >> HDFS? >You'll need to compile Spark against CDH4 if you want that instance of >Spark to communicate the the HDFS version in CDH4. >> >> I'm fairly new to Hadoop versioning, and my team is trying to plan out >>our >> deployment strategy. We want our users to be able to use existing >>clusters >> backed by CDH4, or allow them to easily spawn clusters with the >>spark-ec2 >> scripts but we want Spark to be built against the same Hadoop jars in >>both >> cases. > >This isn't possible because of how Hadoop versioning works. A given >compiled version of Spark can only work against one version of >Hadoop/HDFS. > >The only caveat is that right now we actually _do_ use CDH4.2.0 in the >EC2 scripts if you set hadoop-major-version to 2, so it might >coincidentally work if you happen to be running that exact version. >But in general, the EC2 scripts can't launch arbitrary versions of >Hadoop. > >Is there a reason you need to run the exact same Spark binary in both >cases? The most obvious way to facilitate your use case would be to >run separate Spark binaries on your CDH4 clusters and the EC2 clusters >used by the users. > >> Thanks, >> >> -Matt Cheah
