> Is it possible for the spark EC2 scripts to deploy clusters set up with > Cloudera's CDH4 hadoop distribution, as opposed to the default hadoop > distributions?
No this isn't possible. The EC2 scripts are designed to launch a cluster from scratch with a specific configuration. However, if you've already gotten a CDH4 cluster running then you can just install Spark on that cluster. > As well, if an existing cluster is running Hadoop with CDH4, and Spark is > compiled against the (non-Cloudera) Hadoop 2 to run the Spark daemons on > that cluster, will there be any problems getting Spark to communicate with > HDFS? You'll need to compile Spark against CDH4 if you want that instance of Spark to communicate the the HDFS version in CDH4. > > I'm fairly new to Hadoop versioning, and my team is trying to plan out our > deployment strategy. We want our users to be able to use existing clusters > backed by CDH4, or allow them to easily spawn clusters with the spark-ec2 > scripts – but we want Spark to be built against the same Hadoop jars in both > cases. This isn't possible because of how Hadoop versioning works. A given compiled version of Spark can only work against one version of Hadoop/HDFS. The only caveat is that right now we actually _do_ use CDH4.2.0 in the EC2 scripts if you set hadoop-major-version to 2, so it might coincidentally work if you happen to be running that exact version. But in general, the EC2 scripts can't launch arbitrary versions of Hadoop. Is there a reason you need to run the exact same Spark binary in both cases? The most obvious way to facilitate your use case would be to run separate Spark binaries on your CDH4 clusters and the EC2 clusters used by the users. > Thanks, > > -Matt Cheah
