> Is it possible for the spark EC2 scripts to deploy clusters set up with
> Cloudera's CDH4 hadoop distribution, as opposed to the default hadoop
> distributions?

No this isn't possible. The EC2 scripts are designed to launch a cluster from
scratch with a specific configuration. However, if you've already
gotten a CDH4 cluster running then you can just install Spark on that
cluster.

> As well, if an existing cluster is running Hadoop with CDH4, and Spark is
> compiled against the (non-Cloudera) Hadoop 2 to run the Spark daemons on
> that cluster, will there be any problems getting Spark to communicate with
> HDFS?
You'll need to compile Spark against CDH4 if you want that instance of
Spark to communicate the the HDFS version in CDH4.
>
> I'm fairly new to Hadoop versioning, and my team is trying to plan out our
> deployment strategy. We want our users to be able to use existing clusters
> backed by CDH4, or allow them to easily spawn clusters with the spark-ec2
> scripts – but we want Spark to be built against the same Hadoop jars in both
> cases.

This isn't possible because of how Hadoop versioning works. A given
compiled version of Spark can only work against one version of
Hadoop/HDFS.

The only caveat is that right now we actually _do_ use CDH4.2.0 in the
EC2 scripts if you set hadoop-major-version to 2, so it might
coincidentally work if you happen to be running that exact version.
But in general, the EC2 scripts can't launch arbitrary versions of
Hadoop.

Is there a reason you need to run the exact same Spark binary in both
cases? The most obvious way to facilitate your use case would be to
run separate Spark binaries on your CDH4 clusters and the EC2 clusters
used by the users.

> Thanks,
>
> -Matt Cheah

Reply via email to